1.Case Study Introduction¶

This case study aims to give you an idea of applying EDA in a real business scenario. You will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

1.1 Business Understanding¶

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it to their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specializes in lending various types of loans to urban customers. You have to use EDA to analyze the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected. When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

  1. If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

  2. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    • The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample

    • All other cases: All other cases when the payment is paid on time.

    When a client applies for a loan, there are four types of decisions that could be taken by the client/company):

    • Approved: The Company has approved loan Application

    • Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want.

    • Refused: The company had rejected the loan (because the client does not meet their requirements etc.).

    • Unused offer: Loan has been cancelled by the client but on different stages of the process.

1.2 Business Objectives¶

This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study. In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

1.3 Data Understanding¶

Our dataset has 3 files as explained below:

  1. application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
  2. previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.
  3. columns_description.csv' is data dictionary which describes the meaning of the variables.

2. Data Exploration¶

2.1 Read the data file¶

We will first read the data file which includes importing the required libraries, loading the dataset and viewing the data file.

In [1]:
# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [63]:
# filter out the warnings
import warnings
warnings.filterwarnings('ignore')
In [64]:
# setting maximum rows & columns display size to 200 for better visibility of data 
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
In [3]:
# load dataset 
app_df =  pd.read_csv('application_data.csv')
In [5]:
# viewing first 5 rows
app_df.head(5)
Out[5]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
In [6]:
#viewing last 5 rows
app_df.tail(5)
Out[6]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
307506 456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 225000.0 Unaccompanied Working Secondary / secondary special Separated With parents 0.032561 -9327 -236 -8456.0 -1982 NaN 1 1 0 1 0 0 Sales staff 1.0 1 1 THURSDAY 15 0 0 0 0 0 0 Services 0.145570 0.681632 NaN 0.2021 0.0887 0.9876 0.8300 0.0202 0.22 0.1034 0.6042 0.2708 0.0594 0.1484 0.1965 0.0753 0.1095 0.1008 0.0172 0.9782 0.7125 0.0172 0.0806 0.0345 0.4583 0.0417 0.0094 0.0882 0.0853 0.0 0.0125 0.2040 0.0887 0.9876 0.8323 0.0203 0.22 0.1034 0.6042 0.2708 0.0605 0.1509 0.2001 0.0757 0.1118 reg oper account block of flats 0.2898 Stone, brick No 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
307507 456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary / secondary special Widow House / apartment 0.025164 -20775 365243 -4388.0 -4090 NaN 1 0 0 1 1 0 NaN 1.0 2 2 MONDAY 8 0 0 0 0 0 0 XNA NaN 0.115992 NaN 0.0247 0.0435 0.9727 0.6260 0.0022 0.00 0.1034 0.0833 0.1250 0.0579 0.0202 0.0257 0.0000 0.0000 0.0252 0.0451 0.9727 0.6406 0.0022 0.0000 0.1034 0.0833 0.1250 0.0592 0.0220 0.0267 0.0 0.0000 0.0250 0.0435 0.9727 0.6310 0.0022 0.00 0.1034 0.0833 0.1250 0.0589 0.0205 0.0261 0.0000 0.0000 reg oper account block of flats 0.0214 Stone, brick No 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
307508 456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 585000.0 Unaccompanied Working Higher education Separated House / apartment 0.005002 -14966 -7921 -6737.0 -5150 NaN 1 1 0 1 0 1 Managers 1.0 3 3 THURSDAY 9 0 0 0 0 1 1 School 0.744026 0.535722 0.218859 0.1031 0.0862 0.9816 0.7484 0.0123 0.00 0.2069 0.1667 0.2083 NaN 0.0841 0.9279 0.0000 0.0000 0.1050 0.0894 0.9816 0.7583 0.0124 0.0000 0.2069 0.1667 0.2083 NaN 0.0918 0.9667 0.0 0.0000 0.1041 0.0862 0.9816 0.7518 0.0124 0.00 0.2069 0.1667 0.2083 NaN 0.0855 0.9445 0.0000 0.0000 reg oper account block of flats 0.7970 Panel No 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0
307509 456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 319500.0 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.005313 -11961 -4786 -2562.0 -931 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 9 0 0 0 1 1 0 Business Entity Type 1 NaN 0.514163 0.661024 0.0124 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0061 NaN NaN 0.0126 NaN 0.9772 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0063 NaN NaN 0.0125 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0062 NaN NaN NaN block of flats 0.0086 Stone, brick No 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
307510 456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 675000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.046220 -16856 -1262 -5128.0 -410 NaN 1 1 1 1 1 0 Laborers 2.0 1 1 THURSDAY 20 0 0 0 0 1 1 Business Entity Type 3 0.734460 0.708569 0.113922 0.0742 0.0526 0.9881 NaN 0.0176 0.08 0.0690 0.3750 NaN NaN NaN 0.0791 NaN 0.0000 0.0756 0.0546 0.9881 NaN 0.0178 0.0806 0.0690 0.3750 NaN NaN NaN 0.0824 NaN 0.0000 0.0749 0.0526 0.9881 NaN 0.0177 0.08 0.0690 0.3750 NaN NaN NaN 0.0805 NaN 0.0000 NaN block of flats 0.0718 Panel No 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0

2.2 Inspect the data frame¶

Next we will check the various attributes like shape (rows and cols) & datatypes

In [7]:
app_df.shape
Out[7]:
(307511, 122)
In [8]:
app_df.columns.values
Out[8]:
array(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE',
       'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1',
       'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG',
       'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG',
       'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG',
       'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE',
       'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE',
       'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE',
       'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE',
       'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI',
       'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI',
       'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI',
       'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE',
       'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype=object)
In [9]:
# checking the data types of all the columns
app_df.dtypes
Out[9]:
SK_ID_CURR                        int64
TARGET                            int64
NAME_CONTRACT_TYPE               object
CODE_GENDER                      object
FLAG_OWN_CAR                     object
FLAG_OWN_REALTY                  object
CNT_CHILDREN                      int64
AMT_INCOME_TOTAL                float64
AMT_CREDIT                      float64
AMT_ANNUITY                     float64
AMT_GOODS_PRICE                 float64
NAME_TYPE_SUITE                  object
NAME_INCOME_TYPE                 object
NAME_EDUCATION_TYPE              object
NAME_FAMILY_STATUS               object
NAME_HOUSING_TYPE                object
REGION_POPULATION_RELATIVE      float64
DAYS_BIRTH                        int64
DAYS_EMPLOYED                     int64
DAYS_REGISTRATION               float64
DAYS_ID_PUBLISH                   int64
OWN_CAR_AGE                     float64
FLAG_MOBIL                        int64
FLAG_EMP_PHONE                    int64
FLAG_WORK_PHONE                   int64
FLAG_CONT_MOBILE                  int64
FLAG_PHONE                        int64
FLAG_EMAIL                        int64
OCCUPATION_TYPE                  object
CNT_FAM_MEMBERS                 float64
REGION_RATING_CLIENT              int64
REGION_RATING_CLIENT_W_CITY       int64
WEEKDAY_APPR_PROCESS_START       object
HOUR_APPR_PROCESS_START           int64
REG_REGION_NOT_LIVE_REGION        int64
REG_REGION_NOT_WORK_REGION        int64
LIVE_REGION_NOT_WORK_REGION       int64
REG_CITY_NOT_LIVE_CITY            int64
REG_CITY_NOT_WORK_CITY            int64
LIVE_CITY_NOT_WORK_CITY           int64
ORGANIZATION_TYPE                object
EXT_SOURCE_1                    float64
EXT_SOURCE_2                    float64
EXT_SOURCE_3                    float64
APARTMENTS_AVG                  float64
BASEMENTAREA_AVG                float64
YEARS_BEGINEXPLUATATION_AVG     float64
YEARS_BUILD_AVG                 float64
COMMONAREA_AVG                  float64
ELEVATORS_AVG                   float64
ENTRANCES_AVG                   float64
FLOORSMAX_AVG                   float64
FLOORSMIN_AVG                   float64
LANDAREA_AVG                    float64
LIVINGAPARTMENTS_AVG            float64
LIVINGAREA_AVG                  float64
NONLIVINGAPARTMENTS_AVG         float64
NONLIVINGAREA_AVG               float64
APARTMENTS_MODE                 float64
BASEMENTAREA_MODE               float64
YEARS_BEGINEXPLUATATION_MODE    float64
YEARS_BUILD_MODE                float64
COMMONAREA_MODE                 float64
ELEVATORS_MODE                  float64
ENTRANCES_MODE                  float64
FLOORSMAX_MODE                  float64
FLOORSMIN_MODE                  float64
LANDAREA_MODE                   float64
LIVINGAPARTMENTS_MODE           float64
LIVINGAREA_MODE                 float64
NONLIVINGAPARTMENTS_MODE        float64
NONLIVINGAREA_MODE              float64
APARTMENTS_MEDI                 float64
BASEMENTAREA_MEDI               float64
YEARS_BEGINEXPLUATATION_MEDI    float64
YEARS_BUILD_MEDI                float64
COMMONAREA_MEDI                 float64
ELEVATORS_MEDI                  float64
ENTRANCES_MEDI                  float64
FLOORSMAX_MEDI                  float64
FLOORSMIN_MEDI                  float64
LANDAREA_MEDI                   float64
LIVINGAPARTMENTS_MEDI           float64
LIVINGAREA_MEDI                 float64
NONLIVINGAPARTMENTS_MEDI        float64
NONLIVINGAREA_MEDI              float64
FONDKAPREMONT_MODE               object
HOUSETYPE_MODE                   object
TOTALAREA_MODE                  float64
WALLSMATERIAL_MODE               object
EMERGENCYSTATE_MODE              object
OBS_30_CNT_SOCIAL_CIRCLE        float64
DEF_30_CNT_SOCIAL_CIRCLE        float64
OBS_60_CNT_SOCIAL_CIRCLE        float64
DEF_60_CNT_SOCIAL_CIRCLE        float64
DAYS_LAST_PHONE_CHANGE          float64
FLAG_DOCUMENT_2                   int64
FLAG_DOCUMENT_3                   int64
FLAG_DOCUMENT_4                   int64
FLAG_DOCUMENT_5                   int64
FLAG_DOCUMENT_6                   int64
FLAG_DOCUMENT_7                   int64
FLAG_DOCUMENT_8                   int64
FLAG_DOCUMENT_9                   int64
FLAG_DOCUMENT_10                  int64
FLAG_DOCUMENT_11                  int64
FLAG_DOCUMENT_12                  int64
FLAG_DOCUMENT_13                  int64
FLAG_DOCUMENT_14                  int64
FLAG_DOCUMENT_15                  int64
FLAG_DOCUMENT_16                  int64
FLAG_DOCUMENT_17                  int64
FLAG_DOCUMENT_18                  int64
FLAG_DOCUMENT_19                  int64
FLAG_DOCUMENT_20                  int64
FLAG_DOCUMENT_21                  int64
AMT_REQ_CREDIT_BUREAU_HOUR      float64
AMT_REQ_CREDIT_BUREAU_DAY       float64
AMT_REQ_CREDIT_BUREAU_WEEK      float64
AMT_REQ_CREDIT_BUREAU_MON       float64
AMT_REQ_CREDIT_BUREAU_QRT       float64
AMT_REQ_CREDIT_BUREAU_YEAR      float64
dtype: object
In [10]:
# check the descriptive statistics of numerical variables
#convert to float for better visibility
pd.options.display.float_format = '{:,.2f}'.format
app_df.describe()
Out[10]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,499.00 307,233.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 104,582.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,509.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 134,133.00 306,851.00 246,546.00 151,450.00 127,568.00 157,504.00 103,023.00 92,646.00 143,620.00 152,683.00 154,491.00 98,869.00 124,921.00 97,312.00 153,161.00 93,997.00 137,829.00 151,450.00 127,568.00 157,504.00 103,023.00 92,646.00 143,620.00 152,683.00 154,491.00 98,869.00 124,921.00 97,312.00 153,161.00 93,997.00 137,829.00 151,450.00 127,568.00 157,504.00 103,023.00 92,646.00 143,620.00 152,683.00 154,491.00 98,869.00 124,921.00 97,312.00 153,161.00 93,997.00 137,829.00 159,080.00 306,490.00 306,490.00 306,490.00 306,490.00 307,510.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 307,511.00 265,992.00 265,992.00 265,992.00 265,992.00 265,992.00 265,992.00
mean 278,180.52 0.08 0.42 168,797.92 599,026.00 27,108.57 538,396.21 0.02 -16,037.00 63,815.05 -4,986.12 -2,994.20 12.06 1.00 0.82 0.20 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102,790.18 0.27 0.72 237,123.15 402,490.78 14,493.74 369,446.46 0.01 4,363.99 141,275.77 3,522.89 1,509.45 11.94 0.00 0.38 0.40 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.00 0.06 0.00 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100,002.00 0.00 0.00 25,650.00 45,000.00 1,615.50 40,500.00 0.00 -25,229.00 -17,912.00 -24,672.00 -7,197.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4,292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189,145.50 0.00 0.00 112,500.00 270,000.00 16,524.00 238,500.00 0.01 -19,682.00 -2,760.00 -7,479.50 -4,299.00 5.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1,570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278,202.00 0.00 0.00 147,150.00 513,531.00 24,903.00 450,000.00 0.02 -15,750.00 -1,213.00 -4,504.00 -3,254.00 9.00 1.00 1.00 0.00 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367,142.50 0.00 1.00 202,500.00 808,650.00 34,596.00 679,500.00 0.03 -12,413.00 -289.00 -2,010.00 -1,720.00 15.00 1.00 1.00 0.00 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456,255.00 1.00 19.00 117,000,000.00 4,050,000.00 258,025.50 4,050,000.00 0.07 -7,489.00 365,243.00 0.00 0.00 91.00 1.00 1.00 1.00 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
  • Around 8% of clients defaulted (TARGET = 1) → data is imbalanced.
  • Average annual income ≈ 168k, but some extremely high values (outliers).
  • Average loan amount ≈ 600k; several high-value loans above 3M.
  • Typical applicant age ≈ 44 years old.
  • Majority have 0–1 child; very few with >5.
In [11]:
app_df['TARGET'].value_counts()
Out[11]:
TARGET
0    282686
1     24825
Name: count, dtype: int64
In [12]:
app_df['TARGET'].value_counts()/len(app_df) * 100
Out[12]:
TARGET
0   91.93
1    8.07
Name: count, dtype: float64
In [13]:
#set figure size
plt.figure(figsize=(4,3))
sns.countplot(data=app_df, x='TARGET')
plt.title('Count of TARGET variable per value')
plt.xlabel('Target Variable')
plt.ylabel('Count')
plt.show()
No description has been provided for this image
In [14]:
#concise summary of the dataframe to view data types of all columns & null values 
#to display all columns use verbose = true
app_df.info(verbose=True, memory_usage=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    SK_ID_CURR                    int64  
 1    TARGET                        int64  
 2    NAME_CONTRACT_TYPE            object 
 3    CODE_GENDER                   object 
 4    FLAG_OWN_CAR                  object 
 5    FLAG_OWN_REALTY               object 
 6    CNT_CHILDREN                  int64  
 7    AMT_INCOME_TOTAL              float64
 8    AMT_CREDIT                    float64
 9    AMT_ANNUITY                   float64
 10   AMT_GOODS_PRICE               float64
 11   NAME_TYPE_SUITE               object 
 12   NAME_INCOME_TYPE              object 
 13   NAME_EDUCATION_TYPE           object 
 14   NAME_FAMILY_STATUS            object 
 15   NAME_HOUSING_TYPE             object 
 16   REGION_POPULATION_RELATIVE    float64
 17   DAYS_BIRTH                    int64  
 18   DAYS_EMPLOYED                 int64  
 19   DAYS_REGISTRATION             float64
 20   DAYS_ID_PUBLISH               int64  
 21   OWN_CAR_AGE                   float64
 22   FLAG_MOBIL                    int64  
 23   FLAG_EMP_PHONE                int64  
 24   FLAG_WORK_PHONE               int64  
 25   FLAG_CONT_MOBILE              int64  
 26   FLAG_PHONE                    int64  
 27   FLAG_EMAIL                    int64  
 28   OCCUPATION_TYPE               object 
 29   CNT_FAM_MEMBERS               float64
 30   REGION_RATING_CLIENT          int64  
 31   REGION_RATING_CLIENT_W_CITY   int64  
 32   WEEKDAY_APPR_PROCESS_START    object 
 33   HOUR_APPR_PROCESS_START       int64  
 34   REG_REGION_NOT_LIVE_REGION    int64  
 35   REG_REGION_NOT_WORK_REGION    int64  
 36   LIVE_REGION_NOT_WORK_REGION   int64  
 37   REG_CITY_NOT_LIVE_CITY        int64  
 38   REG_CITY_NOT_WORK_CITY        int64  
 39   LIVE_CITY_NOT_WORK_CITY       int64  
 40   ORGANIZATION_TYPE             object 
 41   EXT_SOURCE_1                  float64
 42   EXT_SOURCE_2                  float64
 43   EXT_SOURCE_3                  float64
 44   APARTMENTS_AVG                float64
 45   BASEMENTAREA_AVG              float64
 46   YEARS_BEGINEXPLUATATION_AVG   float64
 47   YEARS_BUILD_AVG               float64
 48   COMMONAREA_AVG                float64
 49   ELEVATORS_AVG                 float64
 50   ENTRANCES_AVG                 float64
 51   FLOORSMAX_AVG                 float64
 52   FLOORSMIN_AVG                 float64
 53   LANDAREA_AVG                  float64
 54   LIVINGAPARTMENTS_AVG          float64
 55   LIVINGAREA_AVG                float64
 56   NONLIVINGAPARTMENTS_AVG       float64
 57   NONLIVINGAREA_AVG             float64
 58   APARTMENTS_MODE               float64
 59   BASEMENTAREA_MODE             float64
 60   YEARS_BEGINEXPLUATATION_MODE  float64
 61   YEARS_BUILD_MODE              float64
 62   COMMONAREA_MODE               float64
 63   ELEVATORS_MODE                float64
 64   ENTRANCES_MODE                float64
 65   FLOORSMAX_MODE                float64
 66   FLOORSMIN_MODE                float64
 67   LANDAREA_MODE                 float64
 68   LIVINGAPARTMENTS_MODE         float64
 69   LIVINGAREA_MODE               float64
 70   NONLIVINGAPARTMENTS_MODE      float64
 71   NONLIVINGAREA_MODE            float64
 72   APARTMENTS_MEDI               float64
 73   BASEMENTAREA_MEDI             float64
 74   YEARS_BEGINEXPLUATATION_MEDI  float64
 75   YEARS_BUILD_MEDI              float64
 76   COMMONAREA_MEDI               float64
 77   ELEVATORS_MEDI                float64
 78   ENTRANCES_MEDI                float64
 79   FLOORSMAX_MEDI                float64
 80   FLOORSMIN_MEDI                float64
 81   LANDAREA_MEDI                 float64
 82   LIVINGAPARTMENTS_MEDI         float64
 83   LIVINGAREA_MEDI               float64
 84   NONLIVINGAPARTMENTS_MEDI      float64
 85   NONLIVINGAREA_MEDI            float64
 86   FONDKAPREMONT_MODE            object 
 87   HOUSETYPE_MODE                object 
 88   TOTALAREA_MODE                float64
 89   WALLSMATERIAL_MODE            object 
 90   EMERGENCYSTATE_MODE           object 
 91   OBS_30_CNT_SOCIAL_CIRCLE      float64
 92   DEF_30_CNT_SOCIAL_CIRCLE      float64
 93   OBS_60_CNT_SOCIAL_CIRCLE      float64
 94   DEF_60_CNT_SOCIAL_CIRCLE      float64
 95   DAYS_LAST_PHONE_CHANGE        float64
 96   FLAG_DOCUMENT_2               int64  
 97   FLAG_DOCUMENT_3               int64  
 98   FLAG_DOCUMENT_4               int64  
 99   FLAG_DOCUMENT_5               int64  
 100  FLAG_DOCUMENT_6               int64  
 101  FLAG_DOCUMENT_7               int64  
 102  FLAG_DOCUMENT_8               int64  
 103  FLAG_DOCUMENT_9               int64  
 104  FLAG_DOCUMENT_10              int64  
 105  FLAG_DOCUMENT_11              int64  
 106  FLAG_DOCUMENT_12              int64  
 107  FLAG_DOCUMENT_13              int64  
 108  FLAG_DOCUMENT_14              int64  
 109  FLAG_DOCUMENT_15              int64  
 110  FLAG_DOCUMENT_16              int64  
 111  FLAG_DOCUMENT_17              int64  
 112  FLAG_DOCUMENT_18              int64  
 113  FLAG_DOCUMENT_19              int64  
 114  FLAG_DOCUMENT_20              int64  
 115  FLAG_DOCUMENT_21              int64  
 116  AMT_REQ_CREDIT_BUREAU_HOUR    float64
 117  AMT_REQ_CREDIT_BUREAU_DAY     float64
 118  AMT_REQ_CREDIT_BUREAU_WEEK    float64
 119  AMT_REQ_CREDIT_BUREAU_MON     float64
 120  AMT_REQ_CREDIT_BUREAU_QRT     float64
 121  AMT_REQ_CREDIT_BUREAU_YEAR    float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
In [4]:
#Count null values per column
null_count = app_df.isnull().sum()
#Total number of rows
total_count = app_df.shape[0]
# Calculate percentage of nulls and round to 2 decimal places
null_pct = ((null_count / total_count) * 100).round(2)
#Filter only columns with missing values (makes plot cleaner)
null_pct = null_pct[null_pct > 0].sort_values(ascending=False)

#Plot using pointplot
plt.figure(figsize=(16,5))
sns.pointplot(x=null_pct.index, y=null_pct.values)
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Missing Values (%)")
plt.title("Percentage of Null Values per Column")
plt.show()
No description has been provided for this image

3. Data Cleaning¶

3.1 Missing Value Imputation¶

In [5]:
##create a copy of base data for manipulation & processing 
new_app_df = app_df.copy()
In [17]:
#check the new dataframe
new_app_df.head(5)
Out[17]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202,500.00 406,597.50 24,700.50 351,000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.02 -9461 -637 -3,648.00 -2120 NaN 1 1 0 1 1 0 Laborers 1.00 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.08 0.26 0.14 0.02 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.63 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 0.03 0.04 0.97 0.62 0.01 0.00 0.07 0.08 0.12 0.04 0.02 0.02 0.00 0.00 reg oper account block of flats 0.01 Stone, brick No 2.00 2.00 2.00 2.00 -1,134.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 1.00
1 100003 0 Cash loans F N N 0 270,000.00 1,293,502.50 35,698.50 1,129,500.00 Family State servant Higher education Married House / apartment 0.00 -16765 -1188 -1,186.00 -291 NaN 1 1 0 1 1 0 Core staff 2.00 1 1 MONDAY 11 0 0 0 0 0 0 School 0.31 0.62 NaN 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.05 0.00 0.01 0.09 0.05 0.99 0.80 0.05 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.00 0.10 0.05 0.99 0.80 0.06 0.08 0.03 0.29 0.33 0.01 0.08 0.06 0.00 0.01 reg oper account block of flats 0.07 Block No 1.00 0.00 1.00 0.00 -828.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
2 100004 0 Revolving loans M Y Y 0 67,500.00 135,000.00 6,750.00 135,000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.01 -19046 -225 -4,260.00 -2531 26.00 1 1 1 1 1 0 Laborers 1.00 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.56 0.73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -815.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
3 100006 0 Cash loans F N Y 0 135,000.00 312,682.50 29,686.50 297,000.00 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.01 -19005 -3039 -9,833.00 -2437 NaN 1 1 0 1 0 0 Laborers 2.00 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.00 0.00 2.00 0.00 -617.00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121,500.00 513,000.00 21,865.50 513,000.00 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.03 -19932 -3038 -4,311.00 -3458 NaN 1 1 0 1 0 0 Core staff 1.00 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.00 0.00 0.00 0.00 -1,106.00 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
In [18]:
#check for missing value in percentage
null_pct.sort_values(ascending=True)
Out[18]:
AMT_GOODS_PRICE                 0.09
EXT_SOURCE_2                    0.21
DEF_30_CNT_SOCIAL_CIRCLE        0.33
OBS_30_CNT_SOCIAL_CIRCLE        0.33
OBS_60_CNT_SOCIAL_CIRCLE        0.33
DEF_60_CNT_SOCIAL_CIRCLE        0.33
NAME_TYPE_SUITE                 0.42
AMT_REQ_CREDIT_BUREAU_MON      13.50
AMT_REQ_CREDIT_BUREAU_WEEK     13.50
AMT_REQ_CREDIT_BUREAU_YEAR     13.50
AMT_REQ_CREDIT_BUREAU_HOUR     13.50
AMT_REQ_CREDIT_BUREAU_QRT      13.50
AMT_REQ_CREDIT_BUREAU_DAY      13.50
EXT_SOURCE_3                   19.83
OCCUPATION_TYPE                31.35
EMERGENCYSTATE_MODE            47.40
TOTALAREA_MODE                 48.27
YEARS_BEGINEXPLUATATION_MEDI   48.78
YEARS_BEGINEXPLUATATION_AVG    48.78
YEARS_BEGINEXPLUATATION_MODE   48.78
FLOORSMAX_MEDI                 49.76
FLOORSMAX_MODE                 49.76
FLOORSMAX_AVG                  49.76
HOUSETYPE_MODE                 50.18
LIVINGAREA_MODE                50.19
LIVINGAREA_MEDI                50.19
LIVINGAREA_AVG                 50.19
ENTRANCES_MODE                 50.35
ENTRANCES_MEDI                 50.35
ENTRANCES_AVG                  50.35
APARTMENTS_AVG                 50.75
APARTMENTS_MODE                50.75
APARTMENTS_MEDI                50.75
WALLSMATERIAL_MODE             50.84
ELEVATORS_AVG                  53.30
ELEVATORS_MODE                 53.30
ELEVATORS_MEDI                 53.30
NONLIVINGAREA_AVG              55.18
NONLIVINGAREA_MODE             55.18
NONLIVINGAREA_MEDI             55.18
EXT_SOURCE_1                   56.38
BASEMENTAREA_MEDI              58.52
BASEMENTAREA_AVG               58.52
BASEMENTAREA_MODE              58.52
LANDAREA_MODE                  59.38
LANDAREA_AVG                   59.38
LANDAREA_MEDI                  59.38
OWN_CAR_AGE                    65.99
YEARS_BUILD_MODE               66.50
YEARS_BUILD_AVG                66.50
YEARS_BUILD_MEDI               66.50
FLOORSMIN_AVG                  67.85
FLOORSMIN_MODE                 67.85
FLOORSMIN_MEDI                 67.85
LIVINGAPARTMENTS_MODE          68.35
LIVINGAPARTMENTS_AVG           68.35
LIVINGAPARTMENTS_MEDI          68.35
FONDKAPREMONT_MODE             68.39
NONLIVINGAPARTMENTS_MEDI       69.43
NONLIVINGAPARTMENTS_AVG        69.43
NONLIVINGAPARTMENTS_MODE       69.43
COMMONAREA_MEDI                69.87
COMMONAREA_AVG                 69.87
COMMONAREA_MODE                69.87
dtype: float64
In [6]:
# get ONLY column names with >50% missing
cols_50 = null_pct[null_pct >= 50].index.tolist()
print("Columns with >50% missing values:")
print(cols_50)
Columns with >50% missing values:
['COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'FONDKAPREMONT_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MEDI', 'FLOORSMIN_MEDI', 'FLOORSMIN_AVG', 'FLOORSMIN_MODE', 'YEARS_BUILD_MEDI', 'YEARS_BUILD_MODE', 'YEARS_BUILD_AVG', 'OWN_CAR_AGE', 'LANDAREA_MEDI', 'LANDAREA_MODE', 'LANDAREA_AVG', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI', 'EXT_SOURCE_1', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'NONLIVINGAREA_AVG', 'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'ELEVATORS_AVG', 'WALLSMATERIAL_MODE', 'APARTMENTS_AVG', 'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'ENTRANCES_AVG', 'ENTRANCES_MODE', 'ENTRANCES_MEDI', 'LIVINGAREA_AVG', 'LIVINGAREA_MODE', 'LIVINGAREA_MEDI', 'HOUSETYPE_MODE']
  • Since these columns have more than 50% null values, it is wise to drop them because when there is too little real data (e.x > 50% values missing), there is a higher biasness risk.
In [7]:
new_app_df.drop(cols_50, axis = 1, inplace = True)
new_app_df.shape
Out[7]:
(307511, 81)
  • We will also drop columns with more than 40% null values as they will also introduce biasness to our dataset
In [21]:
# get the column names with >40% missing
In [7]:
cols_between_40_and_50 = null_pct[(null_pct > 40) & (null_pct < 50)].index.tolist()
cols_between_40_and_50
Out[7]:
['FLOORSMAX_AVG',
 'FLOORSMAX_MEDI',
 'FLOORSMAX_MODE',
 'YEARS_BEGINEXPLUATATION_AVG',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'TOTALAREA_MODE',
 'EMERGENCYSTATE_MODE']
In [8]:
new_app_df.drop(cols_between_40_and_50, axis=1, inplace=True)
new_app_df.shape
Out[8]:
(307511, 114)
  • As OCCUPATION_TYPE is a categorical variable and is of object type and since the missing value percentage is high (31.35%) we could NOT take its mode value to fill the missing ones because that will simply make the data biased. So, it would be safe to rather create a new type 'Unknown' to fill the missing values.For the rest of categorical columns, we will fill them with mode.
In [9]:
cat_cols = new_app_df.select_dtypes(include='object').columns

for col in cat_cols:
    if col in ['OCCUPATION_TYPE']:
        new_app_df[col].fillna('Unknown', inplace=True)
    else:
        new_app_df[col].fillna(new_app_df[col].mode()[0], inplace=True)
C:\Users\HP\AppData\Local\Temp\ipykernel_10208\2814705985.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  new_app_df[col].fillna(new_app_df[col].mode()[0], inplace=True)
C:\Users\HP\AppData\Local\Temp\ipykernel_10208\2814705985.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  new_app_df[col].fillna('Unknown', inplace=True)
  • For the rest of the numerical columns, we will impute them with median.
In [10]:
num_cols = new_app_df.select_dtypes(include=['int64', 'float64']).columns
for col in num_cols:
    new_app_df[col].fillna(new_app_df[col].median(), inplace=True)
C:\Users\HP\AppData\Local\Temp\ipykernel_10208\3182007245.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  new_app_df[col].fillna(new_app_df[col].median(), inplace=True)
In [11]:
#sanity check to see if there are any null columns left
null_col_check = new_app_df.columns.isna().sum()
null_col_check
Out[11]:
np.int64(0)

3.2 Data Correction¶

  • Firstly, we will convert the columns FLAG_OWN_CAR & FLAG_OWN_REALTY to int
In [12]:
# fix FLAG columns (Y/N → 1/0)
flag_text_cols = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY']
for col in flag_text_cols:
    new_app_df[col] = new_app_df[col].map({'Y': 1, 'N': 0}).astype('int64')
  • We will also change these columns as count is usually in int
    • CNT_FAM_MEMBERS
    • OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE
    • OBS_60_CNT_SOCIAL_CIRCLE, DEF_60_CNT_SOCIAL_CIRCLE
    • AMT_REQ_CREDIT_BUREAU_HOUR, AMT_REQ_CREDIT_BUREAU_DAY
    • AMT_REQ_CREDIT_BUREAU_WEEK, AMT_REQ_CREDIT_BUREAU_MON
    • AMT_REQ_CREDIT_BUREAU_QRT, AMT_REQ_CREDIT_BUREAU_YEAR
In [13]:
count_cols = [
    'CNT_FAM_MEMBERS',
    'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
    'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
    'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
    'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
    'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'
]
for col in count_cols:
    new_app_df[col] = new_app_df[col].astype('int64')
  • We will also change these columns to int as days are usually in int
  • DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE
In [14]:
days_cols = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']

for col in days_cols:
    new_app_df[col] = new_app_df[col].astype('int64')

3.3 Data Standardization¶

  • There are some columns that are negative in value but should be positive as a negative value wouldn't make sense in such columns. We need to convert them to positive using their absolute values.Those columns are listed below.
    • DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH, DAYS_LAST_PHONE_CHANGE
In [15]:
pos_days_cols = ['DAYS_BIRTH','DAYS_EMPLOYED','DAYS_REGISTRATION','DAYS_ID_PUBLISH','DAYS_LAST_PHONE_CHANGE']
# concvert to absolute values
for col in pos_days_cols:
    new_app_df[col]= new_app_df[col].abs()
  • We should also convert columns DAYS_BIRTH & DAYS_EMPLOYED to AGE_YEARS & YEARS_EMPLOYED for better readability
In [16]:
new_app_df['AGE_YEARS']= (new_app_df['DAYS_BIRTH'] / 365.25).round().astype('int64')
new_app_df['AGE_YEARS']
Out[16]:
0         26
1         46
2         52
3         52
4         55
          ..
307506    26
307507    57
307508    41
307509    33
307510    46
Name: AGE_YEARS, Length: 307511, dtype: int64
In [17]:
new_app_df['YEARS_EMPLOYED'] = (new_app_df['DAYS_EMPLOYED'] / 365.25).round().astype('int64')
new_app_df['YEARS_EMPLOYED']
Out[17]:
0            2
1            3
2            1
3            8
4            8
          ... 
307506       1
307507    1000
307508      22
307509      13
307510       3
Name: YEARS_EMPLOYED, Length: 307511, dtype: int64
In [18]:
#drop DAYS_BIRTH to avoid redundancy
new_app_df.drop(columns=['DAYS_BIRTH'], inplace=True)
new_app_df.drop(columns=['DAYS_EMPLOYED'], inplace=True)
  • We will also convert 1000 (365243 in days) which is a placeholder -> nan in DAYS_EMPLOYED
  • This is because 365243 is used to represent specific flag to indicate applicant is unemployed
  • Since we had converted the DAYS_EMPLOYED to YEARS_EMPLOYED, we will now convert 1000 to nan
In [19]:
new_app_df['YEARS_EMPLOYED'].replace(1000, np.nan, inplace=True)
C:\Users\HP\AppData\Local\Temp\ipykernel_10208\209312331.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  new_app_df['YEARS_EMPLOYED'].replace(1000, np.nan, inplace=True)
  • We will also convert CODE_GENDER 'XNA' to nan as such gender doesn't exist
In [21]:
new_app_df['CODE_GENDER'].value_counts()
Out[21]:
CODE_GENDER
F      202448
M      105059
XNA         4
Name: count, dtype: int64
In [20]:
new_app_df['CODE_GENDER'] = new_app_df['CODE_GENDER'].replace('XNA',np.nan)
new_app_df['CODE_GENDER'].value_counts()
Out[20]:
CODE_GENDER
F    202448
M    105059
Name: count, dtype: int64

3.4 Outlier Analysis¶

  • Next we are going to detect and possibly remove outliers from our dataset. Outliers are values that are unusually high or low compared to the rest of the data. Below are the columns we will perform outlier analysis on
    • AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE, AGE_YEARS, YEARS_EMPLOYED, CNT_CHILDREN, CNT_FAM_MEMBERS, EXT_SOURCE_2, EXT_SOURCE_3
    • Firstly we will perform IQR Rule followed by visualizations using boxlpots for each column.
In [21]:
outlier_cols = [
    'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
    'AGE_YEARS', 'YEARS_EMPLOYED', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS',
    'EXT_SOURCE_2', 'EXT_SOURCE_3'
]

def find_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    return len(outliers), lower, upper

# Summary
outlier_summary = []
for col in outlier_cols:
    n_out, low, up = find_outliers(new_app_df, col)
    pct = n_out / len(new_app_df) * 100
    outlier_summary.append({
        'Column': col,
        'Outliers': n_out,
        '%': round(pct, 2),
        'Lower': round(low, 2),
        'Upper': round(up, 2)
    })
    
summary_df = pd.DataFrame(outlier_summary)
print(summary_df)
             Column  Outliers     %      Lower       Upper
0  AMT_INCOME_TOTAL     14035  4.56  -22500.00   337500.00
1        AMT_CREDIT      6562  2.13 -537975.00  1616625.00
2       AMT_ANNUITY      7504  2.44  -10584.00    61704.00
3   AMT_GOODS_PRICE     14728  4.79 -423000.00  1341000.00
4         AGE_YEARS         0  0.00       4.00       84.00
5    YEARS_EMPLOYED     13457  4.38      -8.50       19.50
6      CNT_CHILDREN      4272  1.39      -1.50        2.50
7   CNT_FAM_MEMBERS      4007  1.30       0.50        4.50
8      EXT_SOURCE_2         0  0.00      -0.01        1.07
9      EXT_SOURCE_3      4313  1.40       0.09        0.97
In [39]:
fig, axes = plt.subplots(2, 5, figsize=(18, 8))
axes = axes.ravel()

for i, col in enumerate(outlier_cols):
    sns.boxplot(data=new_app_df, y=col, ax=axes[i], color='lightblue')
    axes[i].set_title(col)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image
  • Next we will perform capping which is used to handle extreme outliers in the numerical columns.
  • In our dataset, we will perform capping by replacing all values that fall above the upper limit with the upper limit value itself.
  • We decided to cap the outliers instead of removing them to ensure data integrity & to take into account for extreme values.
  • We acknowlege that extreme values are real values that provide valuable insight in our analysis.
In [22]:
#capping
# List of columns to cap
cap_cols = [
    'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
    'CNT_CHILDREN', 'CNT_FAM_MEMBERS','YEARS_EMPLOYED'
]

# Cap using IQR upper bound
for col in cap_cols:
    Q1 = new_app_df[col].quantile(0.25)
    Q3 = new_app_df[col].quantile(0.75)
    IQR = Q3 - Q1
    upper = Q3 + 1.5 * IQR
    new_app_df[col] = new_app_df[col].clip(upper=upper)
In [48]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(18, 8))
axes = axes.ravel()

for i, col in enumerate(outlier_cols):
    sns.boxplot(data=new_app_df, y=col, ax=axes[i], color='lightblue')
    axes[i].set_title(col)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image
  • From the boxplot, it can be seen that no more outliers exist as they are capped at their upper quantile values

3.5 Feature Binning¶

  • We will also perform feature binning on some of the numerical columns to make sure that we can later used those binned columns for visualization of the default risk rate plot.
  • These are the columns we will perform binning on:
    • AMT_CREDIT
    • AMT_ANNUITY
    • AMT_INCOME_TOTAL
    • AMT_GOODS_PRICE
    • AGE_YEARS
    • YEARS_EMPLOYED
    • EXT_SOURCE_2
    • EXT_SOURCE_3

AMT_CREDIT¶

In [23]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
new_app_df['CREDIT_BIN_FREQ'] = pd.qcut(
    new_app_df['AMT_CREDIT'], 
    q=N_BINS, 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

new_app_df['CREDIT_BIN_FREQ'].value_counts().sort_index()
Out[23]:
CREDIT_BIN_FREQ
Q1_Low        78421
Q2_MidLow     75428
Q3_MidHigh    77786
Q4_High       75876
Name: count, dtype: int64

AMT_ANNUITY¶

In [24]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
new_app_df['ANNUITY_BIN_FREQ'] = pd.qcut(
    new_app_df['AMT_ANNUITY'], 
    q=N_BINS, 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

new_app_df['ANNUITY_BIN_FREQ'].value_counts().sort_index()
Out[24]:
ANNUITY_BIN_FREQ
Q1_Low        76893
Q2_MidLow     76892
Q3_MidHigh    76962
Q4_High       76764
Name: count, dtype: int64

AMT_INCOME_TOTAL¶

In [25]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
new_app_df['INCOME_BIN_FREQ'] = pd.qcut(
    new_app_df['AMT_INCOME_TOTAL'], 
    q=N_BINS, 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

new_app_df['INCOME_BIN_FREQ'].value_counts().sort_index()
Out[25]:
INCOME_BIN_FREQ
Q1_Low        100578
Q2_MidLow      53182
Q3_MidHigh     82213
Q4_High        71538
Name: count, dtype: int64

AMT_GOODS_PRICE¶

In [26]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
new_app_df['GOODS_PRICE_BIN_FREQ'] = pd.qcut(
    new_app_df['AMT_GOODS_PRICE'], 
    q=N_BINS, 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

new_app_df['GOODS_PRICE_BIN_FREQ'].value_counts().sort_index()
Out[26]:
GOODS_PRICE_BIN_FREQ
Q1_Low        79877
Q2_MidLow     78380
Q3_MidHigh    73720
Q4_High       75534
Name: count, dtype: int64

AGE¶

In [27]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4

# Create the new binned column for Age
new_app_df['AGE_BIN_FREQ'] = pd.qcut(
    new_app_df['AGE_YEARS'], 
    q=N_BINS, 
    labels=['Q1_Youngest', 'Q2_Young', 'Q3_MidAge', 'Q4_Oldest'], 
    duplicates='drop' # Handles identical values at cutoffs
)
new_app_df['AGE_BIN_FREQ'].value_counts().sort_index()
Out[27]:
AGE_BIN_FREQ
Q1_Youngest    80804
Q2_Young       76153
Q3_MidAge      78242
Q4_Oldest      72312
Name: count, dtype: int64

YEARS_EMPLOYED¶

In [28]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4

# Create the new binned column for Years Employed
new_app_df['EMPLOYMENT_BIN_FREQ'] = pd.qcut(
    new_app_df['YEARS_EMPLOYED'], 
    q=N_BINS, 
    labels=['Q1_Shortest', 'Q2_Short', 'Q3_Long', 'Q4_Longest'], 
    duplicates='drop' 
)
new_app_df['EMPLOYMENT_BIN_FREQ'].value_counts().sort_index()
Out[28]:
EMPLOYMENT_BIN_FREQ
Q1_Shortest    75163
Q2_Short       69813
Q3_Long        51804
Q4_Longest     55357
Name: count, dtype: int64
  • We will first convert the EMPLOYMENT_BIN_FREQ into an object
In [29]:
# 1. FIX: Convert the column from 'category' type to generic 'object' (string) type.
# This removes the fixed category list constraint, allowing the new string to be added.
new_app_df['EMPLOYMENT_BIN_FREQ'] = new_app_df['EMPLOYMENT_BIN_FREQ'].astype(object)

print(new_app_df['EMPLOYMENT_BIN_FREQ'].isna().sum())
55374
  • We will then add a new category which will contain the 'No Employment' values
In [30]:
new_app_df['EMPLOYMENT_BIN_FREQ'] = new_app_df['EMPLOYMENT_BIN_FREQ'].fillna('No Employment')
In [31]:
new_app_df['EMPLOYMENT_BIN_FREQ'].isna().sum()
Out[31]:
np.int64(0)
  • We will use np.where since the round method gave incorrect results.
  • Using np.where, we can make sure that values that are 2.5 are converted to 3 accurately.

CNT_CHILDREN¶

In [32]:
new_app_df['CNT_CHILDREN'] = np.where(
    (new_app_df['CNT_CHILDREN'] > 2.49) & (new_app_df['CNT_CHILDREN'] < 2.51),
    3.0,
    new_app_df['CNT_CHILDREN']
)

# Enforce integer type now that the fix is applied
new_app_df['CNT_CHILDREN'] = new_app_df['CNT_CHILDREN'].astype(int)

print("--- Data Integrity Check (After Fixing 2.5 -> 3) ---")
print("Unique Counts in the MODIFIED original column:")
print(new_app_df['CNT_CHILDREN'].value_counts().sort_index())

##--- B. CREATE STRING TIER (FOR REPORTING/EDA) ---
##This converts the placeholder 3 into the readable string '3+ Children'.

CAP_VALUE = 3

new_app_df['CNT_CHILDREN_OBJ'] = np.where(
    new_app_df['CNT_CHILDREN'] == CAP_VALUE,
    f'{CAP_VALUE}+', # Labels the placeholder 3 as the string '3+ Children'
    new_app_df['CNT_CHILDREN'].astype(str) # Converts 0, 1, 2 to '0', '1', '2'
)
print(new_app_df['CNT_CHILDREN_OBJ'].value_counts().sort_index())
--- Data Integrity Check (After Fixing 2.5 -> 3) ---
Unique Counts in the MODIFIED original column:
CNT_CHILDREN
0    215371
1     61119
2     26749
3      4272
Name: count, dtype: int64
CNT_CHILDREN_OBJ
0     215371
1      61119
2      26749
3+      4272
Name: count, dtype: int64

CNT_FAM_MEMBERS¶

In [33]:
# 1. FIX CORRUPTION (4.5 -> 5.0)
# We assume a similar corruption exists for family members (e.g., 4.5) and fix it.
new_app_df['CNT_FAM_MEMBERS'] = np.where(
    (new_app_df['CNT_FAM_MEMBERS'] > 4.49) & (new_app_df['CNT_FAM_MEMBERS'] < 4.51),
    5.0, # Fix 4.5 to 5.0
    new_app_df['CNT_FAM_MEMBERS']
)
new_app_df['CNT_FAM_MEMBERS'] = new_app_df['CNT_FAM_MEMBERS'].astype(int)

CAP_MEMBERS = 5

# 2. CREATE FINAL STRING TIER (Capping Logic)
# We cap all counts >= 5 into the '5+ Members' group.
new_app_df['CNT_FAM_MEMBERS_OBJ'] = np.where(
    new_app_df['CNT_FAM_MEMBERS'] >= CAP_MEMBERS,
    f'{CAP_MEMBERS}+', # Labels 5, 6, 7, etc. as '5+ Members'
    new_app_df['CNT_FAM_MEMBERS'].astype(str) # Converts 1, 2, 3, 4 to strings
)

new_app_df['CNT_FAM_MEMBERS_OBJ'].value_counts().sort_index()
Out[33]:
CNT_FAM_MEMBERS_OBJ
1      67847
2     158359
3      52601
4      24697
5+      4007
Name: count, dtype: int64

EXT_SOURCE_2¶

In [34]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4

# Create the new binned column for Age
new_app_df['EXT_SOURCE_2_BIN_FREQ'] = pd.qcut(
    new_app_df['EXT_SOURCE_2'], 
    q=N_BINS, 
    # Optional: name your categories 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles identical values at cutoffs
)
new_app_df['EXT_SOURCE_2_BIN_FREQ'].value_counts().sort_index()
Out[34]:
EXT_SOURCE_2_BIN_FREQ
Q1_Low        76879
Q2_MidLow     77207
Q3_MidHigh    76548
Q4_High       76877
Name: count, dtype: int64

EXT_SOURCE_3¶

In [35]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4

# Create the new binned column for Age
new_app_df['EXT_SOURCE_3_BIN_FREQ'] = pd.qcut(
    new_app_df['EXT_SOURCE_3'], 
    q=N_BINS, 
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles identical values at cutoffs
)
new_app_df['EXT_SOURCE_3_BIN_FREQ'].value_counts().sort_index()
Out[35]:
EXT_SOURCE_3_BIN_FREQ
Q1_Low         77497
Q2_MidLow     106845
Q3_MidHigh     47086
Q4_High        76083
Name: count, dtype: int64

4. Data Analysis¶

  • Next we will perform segmented univariate analysis using segmented univariate, bivariate & multivariate analysis.

4.1 Segmented Univariate Analysis¶

Univariate Categorical Analysis¶

  • We will perform this analysis on the categorial columns below:
    • NAME_CONTRACT_TYPE, CODE_GENDER, OCCUPATION_TYPE, NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE, AGE_GROUP,
    • In order to determine the % of the group that is most likely to default, we will use the default risk plot which utilises the formula below: Default Rate (Risk) = Count of Target (1) / (Count of Target (1) + Count of Target (0))
In [36]:
#function to plot univariate categorical variables
def cat_plot(df, col_x):
    plt.figure(figsize=(10,5))
    # create the count plot
    ax = sns.countplot(data=df, x=col_x, hue='TARGET')

    # create the title
    plt.title(f"Target Distribution by {col_x}")
    plt.xticks(rotation=45, ha='right')
    #show the plot
    plt.show()
In [37]:
# 1. Calculate the Default Risk Rate (Mean of TARGET) for each category
# The mean of a binary column (0 or 1) is the proportion of 1s (defaulters)

def univariate_decision_support_risk_plot(df, col_x):
   default_rate_pct = df.groupby(col_x)['TARGET'].mean()

   # 2. Convert to percentage and round to 2 decimal places
   default_rate_pct = (default_rate_pct * 100).round(2)

   # 3. Sort the categories by risk rate (highest risk first)
   default_rate_pct = default_rate_pct.sort_values(ascending=False)

   # 4. Prepare data for plotting (reset index to make columns)
   plot_data = default_rate_pct.reset_index()
   plot_data.columns = [col_x, 'Default Risk Rate (%)']

   # Plot using pointplot (as requested)
   plt.figure(figsize=(10, 5))
   sns.barplot(x=col_x, y='Default Risk Rate (%)', data=plot_data)
   plt.xticks(rotation=45, ha='right')
   plt.xlabel(col_x)
   plt.ylabel("Default Risk Rate (%)")
   plt.title(f"Default Risk Rate by {col_x}")
   plt.show()

NAME_CONTRACT_TYPE¶

In [58]:
cat_plot(new_app_df,'NAME_CONTRACT_TYPE')
No description has been provided for this image
  • The number of deefaulters & those paying on time is higher for cash loans as compared to revolving loans.
In [59]:
univariate_decision_support_risk_plot(new_app_df,'NAME_CONTRACT_TYPE')
No description has been provided for this image
  • Revolving loans appear to have a lower default risk rate as compared to cash loans.

CODE_GENDER¶

In [163]:
cat_plot(new_app_df,'CODE_GENDER')
No description has been provided for this image
  • The number of defaulters & those paying on time is higher for females as compared to males.
In [164]:
univariate_decision_support_risk_plot(new_app_df,'CODE_GENDER')
No description has been provided for this image
  • Males appear to have a higher default risk rate as compared to females.

OCCUPATION_TYPE¶

In [165]:
cat_plot(new_app_df,'OCCUPATION_TYPE')
No description has been provided for this image
  • The number of defaulters and those paying on time is the highest for unknown.
In [166]:
univariate_decision_support_risk_plot(new_app_df,'OCCUPATION_TYPE')
No description has been provided for this image
  • Low skill Laborers appear to have the highest default risk rate.
  • Accountants have the lowest default risk rate.

NAME_INCOME_TYPE¶

In [167]:
cat_plot(new_app_df,'NAME_INCOME_TYPE')
No description has been provided for this image
  • Working people have the highest defaulters and those paying on time.
In [168]:
univariate_decision_support_risk_plot(new_app_df,'NAME_INCOME_TYPE')
No description has been provided for this image
  • People with maternity leave appears to have highest default risk rate followed closely by unemployed.
  • Businessman and student seem to have a non-existant default risk rate.
  • Pensioners have the lowest default risk rate.

NAME_EDUCATION_TYPE¶

In [169]:
cat_plot(new_app_df,'NAME_EDUCATION_TYPE')
No description has been provided for this image
  • The number of defaulters and those paying on time is the highest for secondary/secondary special.
In [170]:
univariate_decision_support_risk_plot(new_app_df,'NAME_EDUCATION_TYPE')
No description has been provided for this image
  • People with lower secondary appear to have the highest default risk rate.
  • People with academic degree appear to have the lowest default risk rate.

NAME_FAMILY_STATUS¶

In [171]:
cat_plot(new_app_df,'NAME_FAMILY_STATUS')
No description has been provided for this image
  • The number of defaulters and those paying on time is the highest for married.
In [172]:
univariate_decision_support_risk_plot(new_app_df,'NAME_FAMILY_STATUS')
No description has been provided for this image
  • People in civil marriage appear to have the highest default risk rate followed very closely by single/not married.
  • The default risk rate for unknown is non-existant.

NAME_HOUSING_TYPE¶

In [173]:
cat_plot(new_app_df,'NAME_HOUSING_TYPE')
No description has been provided for this image
  • Those staying in house/apartment has the highest number of defaulters and those paying on time.
In [174]:
univariate_decision_support_risk_plot(new_app_df,'NAME_HOUSING_TYPE')
No description has been provided for this image
  • People with rented apartment have the highest default risk rate.
  • People with office apartment has the lowest default risk rate.

CNT_CHILDREN¶

In [175]:
cat_plot(new_app_df,'CNT_CHILDREN_OBJ')
No description has been provided for this image
  • As the number of children increases, the number of defaulters and repayers on time decreases.
In [176]:
univariate_decision_support_risk_plot(new_app_df,'CNT_CHILDREN_OBJ')
No description has been provided for this image
  • The default risk rate is the highest for 3+ children.
  • The default risk rate is the lowest for 0 children.

CNT_FAM_MEMBERS¶

In [177]:
cat_plot(new_app_df,'CNT_FAM_MEMBERS_OBJ')
No description has been provided for this image
  • The number of defaulters & payment on time is the highest for 2 family members.
In [178]:
univariate_decision_support_risk_plot(new_app_df,'CNT_FAM_MEMBERS_OBJ')
No description has been provided for this image
  • The default risk rate is the highest for 5+ members.
  • The default risk rate is the lowest for 2 members.

FLAG_OWN_CAR¶

In [179]:
cat_plot(new_app_df,'FLAG_OWN_CAR')
No description has been provided for this image
  • The number of defaulters and payment on time is the highest for those not owning a car.
In [180]:
univariate_decision_support_risk_plot(new_app_df,'FLAG_OWN_CAR')
No description has been provided for this image
  • The default risk rate is higher for those not owning a car.

FLAG_OWN_REALTY¶

In [181]:
cat_plot(new_app_df,'FLAG_OWN_REALTY')
No description has been provided for this image
  • The number of defaulters and payment on time is higher for those owning an asset.
In [182]:
univariate_decision_support_risk_plot(new_app_df,'FLAG_OWN_REALTY')
No description has been provided for this image
  • The default risk rate is higher for those not owning an asset.

REGION_RATING_CLIENT¶

In [183]:
cat_plot(new_app_df,'REGION_RATING_CLIENT')
No description has been provided for this image
  • Region rated as 2 has the highest defaulters and payment on time.
In [184]:
univariate_decision_support_risk_plot(new_app_df,'REGION_RATING_CLIENT')
No description has been provided for this image
  • The default risk rate is highest for the region rated as 3.
  • The default risk rate is the lowest for the region rated as 1.

Univariate Numerical Analysis¶

  • We will perform this analysis on the numerical columns below:
    • CNT_CHILDREN, AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE, CNT_FAM_MEMBERS, AGE_YEARS, YEARS_EMPLOYED, FLAG_OWN_CAR, FLAG_OWN_REALTY
    • The higher the blue peak relative to its red peak, the higher the risk density & vice versa.
    • Once we find the risk density, we will plot the default risk rate graph.
In [38]:
df0 = new_app_df[new_app_df['TARGET'] == 0]
df1 = new_app_df[new_app_df['TARGET'] == 1]
In [39]:
# function to plot numerical variables
def cont_plot(df, col_x):
    plt.figure(figsize=(10,6))
    
    # 1. Use plt.gca() to get the current Axes object for the first plot
    #    and assign it to 'ax'.
    ax = sns.kdeplot(
        df0[col_x], 
        label="Target 0 (No Default)", 
        color="Red",
        fill=True
    )
    
    # 2. Plot the second distribution, referencing the same axes object 'ax'
    #    to ensure they are overlaid correctly.
    ax = sns.kdeplot(
        df1[col_x], 
        label="Target 1 (Default)", 
        color="Blue",
        ax=ax, 
        fill=True
    )
    
    # 3. Create the title and labels
    plt.title(f"Target Distribution by {col_x}")
    plt.xlabel(col_x)
    plt.ylabel("Density")
    plt.legend()
    
    # 4. Show the plot
    plt.show()

AMT_INCOME_TOTAL¶

In [102]:
cont_plot(new_app_df,'AMT_INCOME_TOTAL')
No description has been provided for this image
  • People with income range around 50000 has the highest risk density.
In [104]:
univariate_decision_support_risk_plot(new_app_df,'INCOME_BIN_FREQ')
No description has been provided for this image
  • The default risk rate is the highest for Q2_MidLow.
  • The default risk rate is the lowest for Q4_High.

AMT_CREDIT¶

In [105]:
cont_plot(new_app_df,'AMT_CREDIT')
No description has been provided for this image
  • People with AMT_CREDIT around 0.5 has the highest risk density.
  • People with AMT_CREDIT around 1.6 has the lowest risk density.
In [106]:
univariate_decision_support_risk_plot(new_app_df,'CREDIT_BIN_FREQ')
No description has been provided for this image
  • The highest default risk rate is the Q2_MidLow.
  • The lowest default risk rate is th Q4_High.

AMT_ANNUITY¶

In [107]:
cont_plot(new_app_df,'AMT_ANNUITY')
No description has been provided for this image
  • People with AMT_ANNUITY between 18000 to 40000 have the highest risk density.
  • People with AMT_ANNUITY >60000 have the lowest risk density.
In [269]:
univariate_decision_support_risk_plot(new_app_df,'ANNUITY_BIN_FREQ')
No description has been provided for this image
  • Q3_MidHigh has the highest default risk rate.
  • Q1_Low has the lowest default risk rate.

AMT_GOODS_PRICE¶

In [270]:
cont_plot(new_app_df,'AMT_GOODS_PRICE')
No description has been provided for this image
  • People with AMT_GOODS_PRICE between 0.4 to 0.6 have the highest risk density.
In [189]:
univariate_decision_support_risk_plot(new_app_df,'GOODS_PRICE_BIN_FREQ')
No description has been provided for this image
  • Q2_MidLow has the highest default risk rate.
  • Q4_High has the lowest default risk rate.

AGE_YEARS¶

In [190]:
cont_plot(new_app_df,'AGE_YEARS')
No description has been provided for this image
  • People with age between 30 to 40 have the highest risk density.
In [191]:
univariate_decision_support_risk_plot(new_app_df,'AGE_BIN_FREQ')
No description has been provided for this image
  • Q1_Youngest has the highest default risk rate.
  • Q4_Oldest has the lowest default risk rate.

YEARS_EMPLOYED¶

In [192]:
cont_plot(new_app_df,'YEARS_EMPLOYED')
No description has been provided for this image
  • People employed between 0 to 5 years have the highest risk density.
In [193]:
univariate_decision_support_risk_plot(new_app_df,'EMPLOYMENT_BIN_FREQ')
No description has been provided for this image
  • Q1_Shortest has the highest default risk rate.
  • Q4_Longest has the lowest default risk rate.

EXT_SOURCE_2¶

In [194]:
cont_plot(new_app_df,'EXT_SOURCE_2')
No description has been provided for this image
  • The risk density is highest between 0.0 to 0.5.
In [195]:
univariate_decision_support_risk_plot(new_app_df,'EXT_SOURCE_2_BIN_FREQ')
No description has been provided for this image
  • The default risk rate is the highest for Q1_Low.
  • The default risk rate is the lowest for Q4_High.

EXT_SOURCE_3¶

In [196]:
cont_plot(new_app_df,'EXT_SOURCE_3')
No description has been provided for this image
  • The risk density is the highest between 0.0 to 0.5.
In [197]:
univariate_decision_support_risk_plot(new_app_df,'EXT_SOURCE_3_BIN_FREQ')
No description has been provided for this image
  • The default risk rate is the highest for Q1_Low.
  • The default risk rate is the lowest for Q4_High.

Univariate Numerical Discrete Analysis¶

4.2 Bivariate Analysis¶

  • We will perform 3 types of bivariate analysis.
    • Categorical-categorical columns
    • Categorical-continuous columns
    • Continuos-continuos columns

Categorical-categorical columns¶

  • These are the pairings used
    • NAME_EDUCATION_TYPE - NAME_INCOME_TYPE
    • NAME_FAMILY_STATUS - CODE_GENDER
    • NAME_HOUSING_TYPE - NAME_CONTRACT_TYPE
    • NAME_INCOME_TYPE - OCCUPATION_TYPE
    • FLAG_OWN_CAR - CNT_CHILDREN
In [40]:
df0 = new_app_df[new_app_df['TARGET']==0]
df1 = new_app_df[new_app_df['TARGET']==1]
In [41]:
# function to plot categorical,categorical 
def cat_cat_plot(df,col_x,col_hue):

    num_hues = df[col_hue].nunique(dropna=False) 

    # 2. Generate the dynamic, high-contrast palette
    if num_hues <= 10:
        # Use 'Set1' for up to 9 categories for maximum contrast
        dynamic_palette = 'Set1' 
    elif num_hues <= 20:
        # Use 'tab20' for up to 20 categories
        dynamic_palette = sns.color_palette("tab20", n_colors=num_hues)
    elif num_hues > 20:
        # If there are more than 20, we use a different qualitative map like 'hls'
        # WARNING: If num_hues > 20, consider re-binning var2 as visualization will be cluttered.
        dynamic_palette = sns.color_palette("hls", n_colors=num_hues)
    
    
    plt.figure(figsize=(15,8))
    # --- Subplot 1: DEFAULTERS (df1) ---
    plt.subplot(1, 2, 1)
    # Pass the column names as strings (var1, var2) to sns.countplot
    ax1 = sns.countplot(data=df1, x=col_x, hue=col_hue, palette=dynamic_palette)

    # Pass the column names as strings for the title and labels
    plt.title(f"Distribution of {col_x} by {col_hue} for DEFAULTERS")
    plt.xlabel(col_x)
    plt.ylabel("Count")
    plt.xticks(rotation=45, ha='right')
    plt.legend(title=col_hue) # Added legend

    # --- Subplot 2: NON-DEFAULTERS (df0) ---
    plt.subplot(1, 2, 2)
    ax2 = sns.countplot(data=df0, x=col_x, hue=col_hue, palette=dynamic_palette)

    plt.title(f"Distribution of {col_x} by {col_hue} for Others")
    plt.xlabel(col_x)
    # Set ylabel to blank or remove it, as it's the same as the first plot's Y-axis
    plt.ylabel("Count") 
    plt.xticks(rotation=45, ha='right')
    plt.legend(title=col_hue) # Added legend
    
    # Crucial for preventing subplot titles and labels from overlapping
    plt.tight_layout()
    plt.show()
In [42]:
def bivariate_decision_support_risk_plot(df, col_x, col_hue):
    # 1. Group by both columns and calculate the mean of TARGET (the Default Rate)
    default_rate_pct = df.groupby([col_x, col_hue])['TARGET'].mean()
    default_rate_pct = (default_rate_pct * 100).round(2)

    plot_data = default_rate_pct.reset_index()
    plot_data.columns = [col_x, col_hue, 'Default Risk Rate (%)']

    num_hues = df[col_hue].nunique(dropna=False) 

    # 2. Generate the dynamic, high-contrast palette
    if num_hues <= 10:
        # Use 'Set1' for up to 9 categories for maximum contrast
        dynamic_palette = 'Set1' 
    elif num_hues <= 20:
        # Use 'tab20' for up to 20 categories
        dynamic_palette = sns.color_palette("tab20", n_colors=num_hues)
    elif num_hues > 20:
        # If there are more than 20, we use a different qualitative map like 'hls'
        # WARNING: If num_hues > 20, consider re-binning var2 as visualization will be cluttered.
        dynamic_palette = sns.color_palette("hls", n_colors=num_hues)
    
    
    plt.figure(figsize=(15,8))
    
    # 3. Plot the data using the default column names:
    ax = sns.barplot(
        data=plot_data, 
        x=col_x,
        y='Default Risk Rate (%)',
        hue=col_hue,
        palette=dynamic_palette
    )
    
    plt.title(f"Default Risk Rate (%) for {col_x} by {col_hue}")
    plt.xlabel(col_x)
    plt.ylabel("Default Rate Percentage")
    # You must manually rename the legend title if you don't use .columns
    plt.legend(title=col_hue) 
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

NAME_EDUCATION_TYPE - NAME_INCOME_TYPE¶

In [111]:
cat_cat_plot(new_app_df,'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE')
No description has been provided for this image
  • People with secondary/secondary special has the highest number of defaulters and payment on time.
In [199]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_EDUCATION_TYPE', 'NAME_INCOME_TYPE')
No description has been provided for this image
  • Maternity leave income type within secondary/secondary special have the highest default risk rate.
  • People with academic degree have a very low default risk rate.

NAME_FAMILY_STATUS - CODE_GENDER¶

In [198]:
cat_cat_plot(new_app_df,'NAME_FAMILY_STATUS','CODE_GENDER')
No description has been provided for this image
  • In all categories other than Single/not married, there are more females than males who have defaulted.
  • Females have made payment on time for all categories.
In [200]:
bivariate_decision_support_risk_plot(new_app_df, 'NAME_FAMILY_STATUS', 'CODE_GENDER')
No description has been provided for this image
  • Males have a higher default risk rate across all family status.
  • Among all females within each family status group, widow has the lowest default risk rate.

NAME_HOUSING_TYPE - NAME_CONTRACT_STATUS¶

In [201]:
cat_cat_plot(new_app_df,'NAME_HOUSING_TYPE','NAME_CONTRACT_TYPE')
No description has been provided for this image
  • Cash loans have higher number of defaulters & payment on time than revolving loans.
  • Cash loans for house/apartment has the highest number of defaulters & payment on time.
In [202]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_HOUSING_TYPE', 'NAME_CONTRACT_TYPE')
No description has been provided for this image
  • Cash loans have a higher default risk rate compared to revolving loans for all housing types.
  • Cash loans for rented apartment has the highest default risk rate followed closely by cash loans with parents.
  • Revolving loans for municipal apartment has the lowest default risk rate.

NAME_INCOME_TYPE - OCCUPATION_TYPE¶

In [203]:
cat_cat_plot(new_app_df,'NAME_INCOME_TYPE','OCCUPATION_TYPE')
No description has been provided for this image
  • Working laborers have the highest number of defaulters.
  • Pensioners have the highest number of people paying on time.
In [204]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_INCOME_TYPE', 'OCCUPATION_TYPE')
No description has been provided for this image
  • Maternity leave laborers has the highest default risk rate followed by maternity leave core staff.
  • Pensioners have among the lowest default rates, highest likelihood of paying on time.
  • The default rate for businessman & student is non-existant.

FLAG_OWN_CAR - CNT_CHILDREN¶

In [205]:
cat_cat_plot(new_app_df, 'FLAG_OWN_CAR','CNT_CHILDREN_OBJ')
No description has been provided for this image
  • The number of defaulters and payment on time is the highest for those with 0 children for both graphs.
  • Those with 3 children has the lowest defaulters for both car categories.
In [206]:
bivariate_decision_support_risk_plot(new_app_df,'FLAG_OWN_CAR', 'CNT_CHILDREN_OBJ')
No description has been provided for this image
  • The default risk rate is the the highest for those with 3+ children regardless of whether the person owns a car or not.
  • The default risk rate is the lowest for those with 0 children regardless of whether the person owns a car or not.

FLAG_OWN_REALTY - CNT_FAM_MEMBERS¶

In [207]:
cat_cat_plot(new_app_df, 'FLAG_OWN_REALTY','CNT_FAM_MEMBERS_OBJ')
No description has been provided for this image
  • The number of defaulters and payment on time is the highest for 2 family members regardless if the person owns an asset or not.
In [208]:
bivariate_decision_support_risk_plot(new_app_df,'FLAG_OWN_REALTY','CNT_FAM_MEMBERS_OBJ')
No description has been provided for this image
  • The default risk rate is the highest for 5+ family members regardless of whether the person own an asset or not.
  • The default risk rate is the lowest for 2 family members regardless of whether the person own an asset or not.

Categorical-continuos columns¶

  • NAME_INCOME_TYPE - AMT_CREDIT
  • NAME_EDUCATION_TYPE - AMT_ANNUITY
  • OCCUPATION_TYPE - AMT_INCOME_TOTAL
  • NAME_HOUSING_TYPE - AMT_GOODS_PRICE
  • FLAG_OWN_REALTY - EXT_SOURCE_2
In [43]:
# function to plot categorical,contuinious
def cat_cont_plot(df, var1, var2):
    
    plt.figure(figsize=(15,8))

    # --- Subplot 1: DEFAULTERS (df1) ---
    plt.subplot(1, 2, 1)
    # Pass the column names as strings (var1, var2) to sns.countplot
    ax1 = sns.boxplot(data=df1, x=var1, y=var2, showfliers=False)
    # Pass the column names as strings for the title and labels
    plt.title(f"Distribution of {var1} by {var2} for Defaulters")
    plt.xlabel(var1)
    plt.xticks(rotation=45, ha='right')
    plt.ylabel(var2)
    
    # --- Subplot 2: NON-DEFAULTERS (df0) ---
    plt.subplot(1, 2, 2)
    ax2 = sns.boxplot(data=df0, x=var1, y=var2, showfliers=False)
    plt.title(f"Distribution of {var1} by {var2} for Others")
    plt.xlabel(var1)
    plt.xticks(rotation=45, ha='right')
    plt.ylabel(var2) 

    # Crucial for preventing subplot titles and labels from overlapping
    plt.tight_layout()
    plt.show()

NAME_INCOME_TYPE - AMT_CREDIT¶

In [211]:
cat_cont_plot(new_app_df,'NAME_INCOME_TYPE','AMT_CREDIT')
No description has been provided for this image
  • The median of maternity leave and unemployed is higher for defaulters as compared to others.
  • The median of businessman is the highest among all repayment groups
In [209]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_INCOME_TYPE','CREDIT_BIN_FREQ')
No description has been provided for this image
  • Unemployed income type within the Q2_MidLow has the highest default risk rate.
  • The default rate for businessman is non-existant.
  • Pensioners have the lowest default rate almost all income categories.

NAME_EDUCATION_TYPE - AMT_ANNUITY¶

In [51]:
cat_cont_plot(new_app_df,'NAME_EDUCATION_TYPE','AMT_ANNUITY')
No description has been provided for this image
  • The median is somewhat similar across almost all education types.
  • The median of academic degree is higher for defaulters as compared to others.
In [213]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_EDUCATION_TYPE','ANNUITY_BIN_FREQ')
No description has been provided for this image
  • Lower secondary with Q3_MidHigh seems to have the highest default risk rate.
  • Academic degree seems to have lowest default risk rate for the columns visible while some of the columns are non-existant.
  • Lower secondary also has the highest default risk rate across all the annuity categories.

OCCUPATION_TYPE - AMT_INCOME_TOTAL¶

In [52]:
cat_cont_plot(new_app_df,'OCCUPATION_TYPE','AMT_INCOME_TOTAL')
No description has been provided for this image
  • The median is somewhat similar between most of the occupation types.
  • All boxes (IQR) overlap each other.
  • The median of IT Staff for defaulters is lower as compared to in others.
  • The median of HR staff is higher for defaulters as compared to others.
In [215]:
bivariate_decision_support_risk_plot(new_app_df,'OCCUPATION_TYPE','INCOME_BIN_FREQ')
No description has been provided for this image
  • Low-skill laborers within Q1_Low has the highest default risk rate.
  • Low-skill laborers also have the highest default risk rate for each categories.
  • Accoutants have the lowest default risk rate for each of the categories.

HOUSING_TYPE - AMT_GOODS_PRICE¶

In [53]:
cat_cont_plot(new_app_df,'NAME_HOUSING_TYPE','AMT_GOODS_PRICE')
No description has been provided for this image
  • The median of most housing types is somewhat similar.
  • The boxes (IQR) overlap each other.
  • The median of Co-op apartment in defaulters is slightly lower as compared to in others.
In [217]:
bivariate_decision_support_risk_plot(new_app_df,'NAME_HOUSING_TYPE','GOODS_PRICE_BIN_FREQ')
No description has been provided for this image
  • Rented apartment within Q2_MidLow has the highest default risk rate followed very closely by staying with parents within Q2_MidLow.
  • Rented apartment has the highest default risk rate across all goods price categories.
  • Office apartment has the lowest default risk rate across almost all goods price categories.

FLAG_OWN_REALTY - EXT_SOURCE_2¶

In [218]:
cat_cont_plot(new_app_df,'FLAG_OWN_REALTY','EXT_SOURCE_2')
No description has been provided for this image
  • The median of FLAG_OWN_REALTY in defaulters is lower as compared to the median in others.
In [219]:
bivariate_decision_support_risk_plot(new_app_df,'FLAG_OWN_REALTY','EXT_SOURCE_2_BIN_FREQ')
No description has been provided for this image
  • The default risk rate is the highest for Q1_Low regardless of whether the person owns an asset or does not.
  • The default risk rate is the lowest for Q4_High in both cases.

CNT_FAM_MEMBERS - EXT_SOURCE_3¶

In [55]:
cat_cont_plot(new_app_df,'CNT_FAM_MEMBERS_OBJ','EXT_SOURCE_3')
No description has been provided for this image
  • The median for cnt_fam_members is lower for defautlers compared to others.
In [44]:
bivariate_decision_support_risk_plot(new_app_df,'CNT_FAM_MEMBERS_OBJ','EXT_SOURCE_3_BIN_FREQ')
C:\Users\HP\AppData\Local\Temp\ipykernel_10208\3845574278.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  default_rate_pct = df.groupby([col_x, col_hue])['TARGET'].mean()
No description has been provided for this image
  • 5+ members within Q1_Low has the highest default risk rate.
  • Those within Q4_High has the lowest default risk rate regardless of the number of family members.

Multivariate Analysis¶

  • We will perform 2 types of multivariate analysis:
    • Categorical-categorical-categorical columns
    • Categorical-categorical-continous columns

Categorical-categorical-categorical¶

In [45]:
def cat_cat_cat_plot(df, col_x, col_hue, col_facet, target_col='TARGET'):   
    # The sns.catplot command handles the segmentation by Target and the third variable (col_facet) simultaneously.

    num_hues = df[col_hue].nunique(dropna=False) 

    # 2. Generate the dynamic, high-contrast palette based on hue column
    if num_hues <= 10:
        dynamic_palette = 'Set1' 
    elif num_hues <= 20:
        dynamic_palette = sns.color_palette("tab20", n_colors=num_hues)
    else:
        # Use 'hls' for a larger number, but note visualization will be cluttered
        dynamic_palette = sns.color_palette("hls", n_colors=num_hues)
    
    # 2. Use Seaborn's catplot for efficient faceting and grouping
    g = sns.catplot(
        data=df, # Use the entire DataFrame (no need for df0/df1 split)
        x=col_x,
        hue=col_hue,
        col=target_col,  # Separates Defaulters (1) vs Non-Defaulters (0) side-by-side
        row=col_facet,  # CREATES A NEW ROW OF PLOTS FOR EACH UNIQUE VALUE OF THE THIRD VARIABLE
        kind="count",
        aspect=2.0,  # Stretches the plot horizontally
        height=4,
        palette=dynamic_palette,
        sharey=True,  # All plots share the same Y-axis scale (Count)
        legend_out=True
    )

    # 3. Clean up the plot appearance
    g.set_titles(row_template='{row_name} Group', col_template='Target = {col_name}')
    g.set_axis_labels(col_x, "Count of Applications")
     
     # FIX: Explicitly iterate over the axes to set x-tick labels using the underlying text. 
    for ax in g.axes.flat:
        # CRITICAL FIX: Force Matplotlib to draw the X-axis labels on this specific axis.
        ax.tick_params(labelbottom=True) 
        
        # FIX for AttributeError: 'list' object has no attribute 'size'. Use len() instead of .size
        if len(ax.get_xticks()) > 0:
            # Get the existing labels (the category names)
            labels = [t.get_text() for t in ax.get_xticklabels()]
            # Re-set them with the rotation and alignment
            ax.set_xticklabels(labels, rotation=45, ha='right')
            
     # g.add_legend(title=col_hue, bbox_to_anchor=(1.02, 0.5), loc='center left')

    plt.suptitle(
        f"Multivariate Count Analysis by {col_x}, {col_hue}, and {col_facet}", 
        y=1.02, # Adjust title position
        fontsize=16
    )
    
    # FIX: Increase bottom margin to ensure rotated X-axis labels are not cut off
    g.fig.subplots_adjust(bottom=0.25, right=0.85, hspace=0.7) 
    plt.show()
In [46]:
def multivariate_decision_support_risk_plot(df, col_x, col_hue, col_facet, target_col='TARGET'):
    # 1. Group by all three columns and calculate the mean of TARGET (the Default Rate)
    default_rate_pct = df.groupby([col_x, col_hue, col_facet])[target_col].mean()
    # Convert mean (e.g., 0.08) to percentage (e.g., 8.0) and reset index for plotting
    plot_data = (default_rate_pct * 100).round(2).reset_index()
    plot_data.columns = [col_x, col_hue, col_facet, 'Default Risk Rate (%)']

    num_hues = df[col_hue].nunique(dropna=False) 

    # 2. Generate the dynamic, high-contrast palette based on hue column
    if num_hues <= 10:
        dynamic_palette = 'Set1' 
    elif num_hues <= 20:
        dynamic_palette = sns.color_palette("tab20", n_colors=num_hues)
    else:
        # Use 'hls' for a larger number, but note visualization will be cluttered
        dynamic_palette = sns.color_palette("hls", n_colors=num_hues)
    
    # 3. Use catplot to create a faceted bar plot
    g = sns.catplot(
        data=plot_data,
        x=col_x,
        y='Default Risk Rate (%)', # Plot the pre-calculated percentage
        hue=col_hue,
        col=col_facet,  # The third variable creates the facets (sub-plots)
        kind="bar",
        col_wrap=2,
        aspect=1.5,
        height=6,
        palette=dynamic_palette,
        legend_out=True
    )

    # 4. Clean up the plot appearance
    g.set_axis_labels(col_x, "Default Risk Rate Percentage")
    g.set_titles(col_template='{col_name} Group')
    
     # FIX: Explicitly iterate over the axes to set x-tick labels using the underlying text. 
    for ax in g.axes.flat:
        # CRITICAL FIX: Force Matplotlib to draw the X-axis labels on this specific axis.
        ax.tick_params(labelbottom=True) 
        
        # FIX for AttributeError: 'list' object has no attribute 'size'. Use len() instead of .size
        if len(ax.get_xticks()) > 0:
            # Get the existing labels (the category names)
            labels = [t.get_text() for t in ax.get_xticklabels()]
            # Re-set them with the rotation and alignment
            ax.set_xticklabels(labels, rotation=45, ha='right')
            
    # g.add_legend(title=col_hue, bbox_to_anchor=(1.02, 0.5), loc='center left')

    plt.suptitle(
        f"Multivariate Default Rate (%) by {col_x}, {col_hue}, and {col_facet}", 
        y=1.02,
        fontsize=16
    )
    # 6. Keep increased bottom margin and vertical spacing (hspace) between rows
    g.fig.subplots_adjust(bottom=0.30, right=0.85, hspace=0.75) 
    plt.show()

NAME_INCOME_TYPE - FLAG_OWN_REALTY - NAME_EDUCATION_TYPE¶

In [220]:
cat_cat_cat_plot(new_app_df,'NAME_INCOME_TYPE','FLAG_OWN_REALTY','NAME_EDUCATION_TYPE')
No description has been provided for this image
  • For the secondary/secondary special group, the number of payment on time are higher for those owning an asset as compared to those not owning an asset.
  • For the secondary/secondary special group, the number of defaulters are higher for those owning an asset as compared to those not owning an asset.
In [221]:
multivariate_decision_support_risk_plot(new_app_df,'NAME_INCOME_TYPE','FLAG_OWN_REALTY','NAME_EDUCATION_TYPE')
No description has been provided for this image
  • For incomplete higher group, the default risk rate is the highest for unemployed & not owning an asset.
  • For incomplete higher group, the default risk rate is higher for those not owning an asset across all income categories.
  • For secondary/secondary special group, the default risk rate is the highest for maternity leave & owning an asset.
  • For secondary/secondary special group, unemployed & owning an asset has a higher default risk rate as compared to not owning an asset.
  • The default risk rate for businessman is non-existant across all income categories.

CNT_CHILDREN - FLAG_OWN_CAR - NAME_EDUCATION_TYPE¶

In [222]:
cat_cat_cat_plot(new_app_df,'CNT_CHILDREN_OBJ','FLAG_OWN_CAR','NAME_EDUCATION_TYPE')
No description has been provided for this image
  • For secondary/secondary special, the number of payment on time & defaulters is the highest for 0 children.
  • For higher education, the number of defaulters and payment on time is the highest for 0 children.
In [223]:
multivariate_decision_support_risk_plot(new_app_df,'CNT_CHILDREN_OBJ','FLAG_OWN_CAR','NAME_EDUCATION_TYPE')
No description has been provided for this image
  • For academic degree, the default risk rate is higher for 1 children & not owning a car as compared to owning a car.
  • For higher education & incomplete higher group, the default risk rate is higher for not owning a car as compared to owning a car for all children categories.
  • For lower secondary group, the default risk rate is higher for owning a car as compared to not owning a car for 1 & 3+ children groups.
  • For secondary/secondary special group, the default risk rate is higher for not owning a car as compared to owning a car for all children groups.

Categorical - cateogrical - continuos¶

In [47]:
def cat_cat_cont_plot(df, col_x, col_hue, col_y, target_col='TARGET'):
    # 1. Use catplot with kind="box" and set 'col' to target_col for faceting
    g = sns.catplot(
        data=df,
        x=col_x,
        y=col_y,
        hue=col_hue,
        col=target_col,    # CRITICAL CHANGE: Facet by TARGET (0 and 1)
        kind="box",  
        col_wrap=2,        # Ensures the two TARGET plots sit side-by-side
        aspect=1.5,
        height=6,
        legend_out=True,
        sharey=True        # Ensure both plots share the same Y-axis scale
    )

    # 2. Clean up the plot appearance
    g.set_axis_labels(col_x, f"Distribution of {col_y}")
    g.set_titles(col_template='Target = {col_name}') # Label facets as Target = 0 and Target = 1
    
    # Set X-tick labels
    for ax in g.axes.flat:
        ax.tick_params(labelbottom=True) 
        if len(ax.get_xticks()) > 0:
            labels = [t.get_text() for t in ax.get_xticklabels()]
            ax.set_xticklabels(labels, rotation=45, ha='right')

    # Add legend and title
    #g.add_legend(title=col_hue, bbox_to_anchor=(1.02, 0.5), loc='center left')
    plt.suptitle(
        f"Distribution of {col_y} by {col_x}, {col_hue}, and {target_col}", 
        y=1.02, 
        fontsize=16
    )
    g.fig.subplots_adjust(bottom=0.30, right=0.85) 
    plt.show()

NAME_INCOME_TYPE - FLAG_OWN_REALTY - AMT_CREDIT¶

In [224]:
cat_cat_cont_plot(new_app_df,'NAME_INCOME_TYPE','FLAG_OWN_REALTY','AMT_CREDIT')
No description has been provided for this image
  • The median for unemployed regardless if they own an asset or not is higher as compared to unemployed.
  • The median for maternity leave & owning an asset for payment on time is lower compared to those defaulting.
In [225]:
multivariate_decision_support_risk_plot(new_app_df,'NAME_INCOME_TYPE','FLAG_OWN_REALTY','CREDIT_BIN_FREQ')
No description has been provided for this image
  • For Q1_Low Group, the default risk rate for working & not owning an asset is higher as compared to owning an asset.
  • For Q2_MidLow Group, the default risk rate for unemployed, state servant, maternity leave & commercial associate while owning an asset is higher as compared to not owning an asset.
  • For Q3_MidHigh Group, the default risk rate for unemployed & owning an asset is higher as compared to not owning an asset.
  • For Q4_High Group, the default risk rate for commercial associate & working while owning an asset is higher as compared to not owning an asset.
  • The default risk rate for businessman is non-existant across all groups.

NAME_EDUCATION_TYPE - FLAG_OWN_CAR - AMT_ANNUITY¶

In [226]:
cat_cat_cont_plot(new_app_df,'NAME_EDUCATION_TYPE','FLAG_OWN_CAR','AMT_ANNUITY')
No description has been provided for this image
  • The median is somewhat similar.
In [227]:
multivariate_decision_support_risk_plot(new_app_df,'NAME_EDUCATION_TYPE','FLAG_OWN_CAR','ANNUITY_BIN_FREQ')
No description has been provided for this image
  • For Q1_Low Group, the default risk rate is higher for higher education. incomplete higher, lower secondary & secondary/secondary special while not owning a car.
  • For Q2_MidLow Group, the default risk rate is higher for higher education,incomplete higher,secondary/secondary special while not owning a car as compared to owning a car.
  • For Q3_MidHigh Group, the default risk rate is higher for higher education,incomplete higher,lower secondary, secondary/secondary special while not owning a car compared to owning a car.
  • For Q4_High Group, the default risk rate is higher for academic degree,higher education,lower secondary, secondary/secondary special while not owning a car compared to owning a car.

Continous-continuos columns¶

  • The below colums are considered.
    • AMT_CREDIT
    • AMT_ANNUITY
    • AMT_GOODS_PRICE
    • AMT_INCOME_TOTAL
In [71]:
#defining function for continious-continous columns
def cont_cont_plot(df, var1, var2):
    
    plt.figure(figsize=(15,8))

    # --- Subplot 1: DEFAULTERS (df1) ---
    plt.subplot(1, 2, 1)
    # Pass the column names as strings (var1, var2) to sns.countplot
    ax1 = sns.scatterplot(data=df1, x=var1, y=var2)
    # Pass the column names as strings for the title and labels
    plt.title(f"Correlation of {var1} by {var2} for Defaulters")
    plt.xlabel(var1)
    plt.ylabel(var2)
    
    # --- Subplot 2: NON-DEFAULTERS (df0) ---
    plt.subplot(1, 2, 2)
    ax2 = sns.scatterplot(data=df0, x=var1, y=var2)
    plt.title(f"Correlation of {var1} by {var2} for Others")
    plt.xlabel(var1)
    plt.ylabel(var2) 

    # Crucial for preventing subplot titles and labels from overlapping
    plt.tight_layout()
    plt.show()
In [63]:
#function to calc correlation
def calc_correlation(df,col1,col2):
    # Correlation for Repayers (T=0)
    corr_repayers = df0[col1].corr(df0[col2])   
    # Correlation for Defaulters (T=1)
    corr_defaulters = df1[col1].corr(df1[col2])

    print(f"--- Correlation for {col1} vs. {col2} ---")
    print(f"Repayers (T=0) Correlation: {corr_repayers:.4f}")
    print(f"Defaulters (T=1) Correlation: {corr_defaulters:.4f}")

    return corr_repayers, corr_defaulters
In [72]:
cont_cont_plot(new_app_df,'AMT_CREDIT','AMT_ANNUITY')
No description has been provided for this image
In [ ]:
rep_corr, def_corr = calc_correlation(new_app_df,'AMT_CREDIT', 'AMT_ANNUITY')
  • In both graphs, AMT_REDIT is strongly positively correlated to AMT_ANNUITY.
In [73]:
cont_cont_plot(new_app_df,'AMT_CREDIT','AMT_GOODS_PRICE')
No description has been provided for this image
In [ ]:
rep_corr, def_corr = calc_correlation(new_app_df,'AMT_CREDIT', 'AMT_GOODS_PRICE')
  • In both graphs, AMT_CREDIT is strongly correlated to AMT_GOODS_PRICE.
In [74]:
cont_cont_plot(new_app_df,'AMT_CREDIT','AMT_INCOME_TOTAL')
No description has been provided for this image
  • There is no correlation between AMT_CREDIT & AMT_INCOME_TOTAL in both graphs.

4.3 Correlation¶

In [75]:
# split data by target 
df0 = new_app_df[new_app_df['TARGET'] == 0]
df1 = new_app_df[new_app_df['TARGET'] == 1]

#compute corr
corr0 = df0.corr(numeric_only=True)
corr1 = df1.corr(numeric_only=True)

def get_top_corr(corr, n=10):
    # take upper triangle only (no duplicates)
    c = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    c = c.stack().sort_values(ascending=False)
    return c.head(n)

top10_target1 = get_top_corr(corr1, 10)
top10_target0 = get_top_corr(corr0, 10)

print("\nTOP 10 CORRELATIONS — Target = 1")
print(top10_target1)

print("\nTOP 10 CORRELATIONS — Target = 0")
print(top10_target0)
TOP 10 CORRELATIONS — Target = 1
OBS_30_CNT_SOCIAL_CIRCLE  OBS_60_CNT_SOCIAL_CIRCLE    0.998270
BASEMENTAREA_AVG          BASEMENTAREA_MEDI           0.998252
COMMONAREA_AVG            COMMONAREA_MEDI             0.998204
NONLIVINGAPARTMENTS_AVG   NONLIVINGAPARTMENTS_MEDI    0.998116
YEARS_BUILD_AVG           YEARS_BUILD_MEDI            0.998115
FLOORSMIN_AVG             FLOORSMIN_MEDI              0.997826
LIVINGAPARTMENTS_AVG      LIVINGAPARTMENTS_MEDI       0.997721
NONLIVINGAPARTMENTS_MODE  NONLIVINGAPARTMENTS_MEDI    0.997076
ENTRANCES_AVG             ENTRANCES_MEDI              0.996697
LIVINGAREA_AVG            LIVINGAREA_MEDI             0.996050
dtype: float64

TOP 10 CORRELATIONS — Target = 0
YEARS_BUILD_AVG           YEARS_BUILD_MEDI            0.998523
OBS_30_CNT_SOCIAL_CIRCLE  OBS_60_CNT_SOCIAL_CIRCLE    0.998510
FLOORSMIN_AVG             FLOORSMIN_MEDI              0.997244
ENTRANCES_AVG             ENTRANCES_MEDI              0.996913
ELEVATORS_AVG             ELEVATORS_MEDI              0.996747
COMMONAREA_AVG            COMMONAREA_MEDI             0.996119
LIVINGAREA_AVG            LIVINGAREA_MEDI             0.995763
APARTMENTS_AVG            APARTMENTS_MEDI             0.995321
BASEMENTAREA_AVG          BASEMENTAREA_MEDI           0.994160
LIVINGAPARTMENTS_AVG      LIVINGAPARTMENTS_MEDI       0.993889
dtype: float64

5. PREVIOUS_APPLICATION¶

5.1 Read the Data File¶

In [76]:
# load dataset
prev_df = pd.read_csv('previous_application.csv')
In [ ]:
#read first 5 rows
prev_df.head(5)
In [ ]:
#read last 5 rows
prev_df.tail(5)

5.2 Inspect the Data Frame¶

In [ ]:
prev_df.shape
In [ ]:
prev_df.columns.values
In [ ]:
prev_df.dtypes
In [ ]:
pd.options.display.float_format = '{:,.2f}'.format
prev_df.describe()
  • AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_DOWN_PAYMENT & AMT_GOODS_PRICE have outliers.

5.3 Data Imbalance Checking¶

In [77]:
#Count null values per column
null_count = prev_df.isnull().sum()
#Total number of rows
total_count = prev_df.shape[0]
# Calculate percentage of nulls per column and round to 2 decimal places
null_pct2 = ((null_count / total_count) * 100).round(2)
#Filter only columns with missing values (makes plot cleaner)
null_pct2 = null_pct2[null_pct2 > 0].sort_values(ascending=False)

#Plot using pointplot
plt.figure(figsize=(16,5))
sns.pointplot(x=null_pct2.index, y=null_pct2.values)
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Missing Values (%)")
plt.title("Percentage of Null Values per Column")
plt.show()
No description has been provided for this image

6.Data Cleaning¶

6.1 Missing Value Imputation¶

In [121]:
# create a copy of base data
new_prev_df = prev_df.copy()
In [ ]:
new_prev_df.head(5)
In [ ]:
#check for missing values in percentage for each col
null_pct2.sort_values(ascending=True)
In [122]:
#get column names with more than 50% missing
cols_50 = null_pct2[null_pct2 >= 50].index.tolist()
cols_50
Out[122]:
['RATE_INTEREST_PRIMARY',
 'RATE_INTEREST_PRIVILEGED',
 'AMT_DOWN_PAYMENT',
 'RATE_DOWN_PAYMENT']
In [ ]:
#remove the colums with more than 50% missing
In [123]:
new_prev_df.drop(cols_50, axis=1, inplace=True)
new_prev_df.shape
Out[123]:
(1670214, 33)
  • We will also drop NAME_TYPE_SUITE since it has null % value close to 50%.
In [124]:
new_prev_df.drop('NAME_TYPE_SUITE', axis=1, inplace=True)
new_prev_df.shape
Out[124]:
(1670214, 32)
In [ ]:
new_prev_df['NFLAG_INSURED_ON_APPROVAL'].value_counts()
  • We will impute the null values of NFLAG_INSURED_ON_APPROVAL with NA eventhough the null & is very close to 50%. NA is chosen because missingness in this column is likely non-random(representing 'Not Applicable').
  • First we will convert its data type to object.
In [125]:
new_prev_df['NFLAG_INSURED_ON_APPROVAL'].astype(object)
Out[125]:
0         0.00
1         1.00
2         1.00
3         1.00
4          NaN
          ... 
1670209   0.00
1670210   0.00
1670211   0.00
1670212   1.00
1670213   0.00
Name: NFLAG_INSURED_ON_APPROVAL, Length: 1670214, dtype: object
In [126]:
#fill null values with NA (third category)
new_prev_df['NFLAG_INSURED_ON_APPROVAL'].fillna('NA', inplace=True)
In [128]:
# check to see any null valus remaining
new_prev_df['NFLAG_INSURED_ON_APPROVAL'].isna().sum()
Out[128]:
np.int64(0)
  • We will impute PRODUCT_COMBINATION with mode since it is a categorical data
In [129]:
new_prev_df['PRODUCT_COMBINATION'].fillna(new_prev_df['PRODUCT_COMBINATION'].mode()[0], inplace=True)
In [130]:
#check to see any null values remaining
new_prev_df['PRODUCT_COMBINATION'].isna().sum()
Out[130]:
np.int64(0)
  • Next, we will get a list of columns with values of 365243.
In [131]:
PLACEHOLDER_CODE = 365243

# 1. Create a boolean Series: True if the column contains the placeholder
#    (Using .any() checks if ANY value in the column is equal to the code)
cols_with_placeholder = (new_prev_df == PLACEHOLDER_CODE).any()

# 2. Filter the index to get the list of column names
placeholder_cols = cols_with_placeholder[cols_with_placeholder].index.tolist()
placeholder_cols
Out[131]:
['SK_ID_CURR',
 'DAYS_FIRST_DRAWING',
 'DAYS_FIRST_DUE',
 'DAYS_LAST_DUE_1ST_VERSION',
 'DAYS_LAST_DUE',
 'DAYS_TERMINATION']
  • We will also convert 365243 which represents a missing value to NAN in DAYS_FIRST_DUE ,DAYS_TERMINATION , DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE
In [132]:
cols_to_convert = ['DAYS_FIRST_DRAWING','DAYS_FIRST_DUE','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION']

for col in cols_to_convert:
    # Action: Modifies the column directly and returns None
    new_prev_df[col].replace(365243, np.nan, inplace=True)
  • We will impute AMT_ANNUITY, CNT_PAYMENT, AMT_GOODS_PRICE, DAYS_FIRST_DUE, DAYS_TERMINATION, DAYS_FIRST_DRAWING, DAYS_LAST_DUE_1ST_VERSION & DAYS_LAST_DUE with the median since they are numerical columns.
In [134]:
num_cols = ['AMT_ANNUITY', 'CNT_PAYMENT', 'AMT_GOODS_PRICE', 'DAYS_FIRST_DUE', 
            'DAYS_TERMINATION', 'DAYS_FIRST_DRAWING','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE']

for col in num_cols:
    # Ensure you are calculating the median *excluding* the NaNs
    median_val = new_prev_df[col].median()
    new_prev_df[col].fillna(median_val, inplace=True)
In [135]:
#check entire dataframe for null values 
new_prev_df.isnull().sum()
Out[135]:
SK_ID_PREV                     0
SK_ID_CURR                     0
NAME_CONTRACT_TYPE             0
AMT_ANNUITY                    0
AMT_APPLICATION                0
AMT_CREDIT                     1
AMT_GOODS_PRICE                0
WEEKDAY_APPR_PROCESS_START     0
HOUR_APPR_PROCESS_START        0
FLAG_LAST_APPL_PER_CONTRACT    0
NFLAG_LAST_APPL_IN_DAY         0
NAME_CASH_LOAN_PURPOSE         0
NAME_CONTRACT_STATUS           0
DAYS_DECISION                  0
NAME_PAYMENT_TYPE              0
CODE_REJECT_REASON             0
NAME_CLIENT_TYPE               0
NAME_GOODS_CATEGORY            0
NAME_PORTFOLIO                 0
NAME_PRODUCT_TYPE              0
CHANNEL_TYPE                   0
SELLERPLACE_AREA               0
NAME_SELLER_INDUSTRY           0
CNT_PAYMENT                    0
NAME_YIELD_GROUP               0
PRODUCT_COMBINATION            0
DAYS_FIRST_DRAWING             0
DAYS_FIRST_DUE                 0
DAYS_LAST_DUE_1ST_VERSION      0
DAYS_LAST_DUE                  0
DAYS_TERMINATION               0
NFLAG_INSURED_ON_APPROVAL      0
dtype: int64
  • AMT_CREDIT has a null value, so we will impute it with median.
In [136]:
new_prev_df['AMT_CREDIT'].fillna(new_prev_df['AMT_CREDIT'].median(), inplace=True)
In [137]:
#recheck entire dataframe for null values 
new_prev_df.isnull().sum()
Out[137]:
SK_ID_PREV                     0
SK_ID_CURR                     0
NAME_CONTRACT_TYPE             0
AMT_ANNUITY                    0
AMT_APPLICATION                0
AMT_CREDIT                     0
AMT_GOODS_PRICE                0
WEEKDAY_APPR_PROCESS_START     0
HOUR_APPR_PROCESS_START        0
FLAG_LAST_APPL_PER_CONTRACT    0
NFLAG_LAST_APPL_IN_DAY         0
NAME_CASH_LOAN_PURPOSE         0
NAME_CONTRACT_STATUS           0
DAYS_DECISION                  0
NAME_PAYMENT_TYPE              0
CODE_REJECT_REASON             0
NAME_CLIENT_TYPE               0
NAME_GOODS_CATEGORY            0
NAME_PORTFOLIO                 0
NAME_PRODUCT_TYPE              0
CHANNEL_TYPE                   0
SELLERPLACE_AREA               0
NAME_SELLER_INDUSTRY           0
CNT_PAYMENT                    0
NAME_YIELD_GROUP               0
PRODUCT_COMBINATION            0
DAYS_FIRST_DRAWING             0
DAYS_FIRST_DUE                 0
DAYS_LAST_DUE_1ST_VERSION      0
DAYS_LAST_DUE                  0
DAYS_TERMINATION               0
NFLAG_INSURED_ON_APPROVAL      0
dtype: int64

6.2 Data Type Correction¶

In [138]:
count_cols = ['DAYS_FIRST_DRAWING','DAYS_FIRST_DUE','DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION']

for col in count_cols:
    new_prev_df[col] = new_prev_df[col].astype('int64')

6.3 Data Standardization¶

  • There are some columns having negative values. We will convert them to positive using their absolute values.
  • The columns are DAYS_FIRST_DRAWING, DAYS_FIRST_DUE, DAYS_LAST_DUE_1ST_VERSION, DAYS_LAST_DUE, DAYS_TERMINATION, DAYS_DECISION.
In [ ]:
new_prev_df.describe()
In [139]:
pos_days_cols = ['DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION','DAYS_DECISION','SELLERPLACE_AREA']
# concvert to absolute values
for col in pos_days_cols:
    new_prev_df[col]= new_prev_df[col].abs()

6.4 Outlier Analysis¶

  • These are the columns we will perform outlier analysis on: AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT, AMT_GOODS_PRICE
  • Firstly we will perform IQR Rule followed by visualizations using boxlpots for each column.
In [140]:
outlier_cols = ['AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE']

def find_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower) | (df[col] > upper)]
    return len(outliers), lower, upper

# Summary
outlier_summary = []
for col in outlier_cols:
    n_out, low, up = find_outliers(new_prev_df, col)
    pct = n_out / len(new_prev_df) * 100
    outlier_summary.append({
        'Column': col,
        'Outliers': n_out,
        '%': round(pct, 2),
        'Lower': round(low, 2),
        'Upper': round(up, 2)
    })
    
summary_df = pd.DataFrame(outlier_summary)
print(summary_df)
            Column  Outliers     %       Lower      Upper
0      AMT_ANNUITY    162620  9.74   -6,368.30  30,739.42
1  AMT_APPLICATION    208019 12.45 -223,740.00 422,820.00
2       AMT_CREDIT    179989 10.78 -264,226.50 504,805.50
3  AMT_GOODS_PRICE    236035 14.13 -101,857.50 349,762.50
In [ ]:
fig, axes = plt.subplots(2, 2, figsize=(8,6))
axes = axes.ravel()

for i, col in enumerate(outlier_cols):
    sns.boxplot(data=new_prev_df, y=col, ax=axes[i], color='lightblue')
    axes[i].set_title(col)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
  • We will also perform capping on AMT_ANNUITY, AMT_APPLICATION, AMT_CREDIT & AMT_GOODS_PRICE
In [141]:
#capping
# List of columns to cap
cap_cols = [
    'AMT_ANNUITY','AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE'
]

# Cap using IQR upper bound
for col in cap_cols:
    Q1 = new_prev_df[col].quantile(0.25)
    Q3 = new_prev_df[col].quantile(0.75)
    IQR = Q3 - Q1
    upper = Q3 + 1.5 * IQR
    new_prev_df[col] = new_prev_df[col].clip(upper=upper)
In [ ]:
fig, axes = plt.subplots(2, 2, figsize=(8,6))
axes = axes.ravel()

for i, col in enumerate(outlier_cols):
    sns.boxplot(data=new_prev_df, y=col, ax=axes[i], color='lightblue')
    axes[i].set_title(col)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

7.MERGING DATA SETS¶

7.1 Merging the Data Sets¶

  • We will perform Data Flattening. Its purpose is to summarize all of an applicant's past loan applications into a single row of features, making the rich historical data usable in the main application table.
In [142]:
### NOTE: THIS TAKES ABOUT 20 MINS TO RUN 

import pandas as pd
import numpy as np

import pandas as pd
import numpy as np

# ============================================
# 1. Columns you specifically want
# ============================================

prev_keep = [
    'SK_ID_CURR',
    'AMT_ANNUITY',
    'AMT_APPLICATION',
    'AMT_CREDIT',
    'AMT_GOODS_PRICE',
    'CNT_PAYMENT',
    'DAYS_DECISION',
    'SELLERPLACE_AREA',
    'NAME_CONTRACT_STATUS',
    'CODE_REJECT_REASON',
    'NAME_CASH_LOAN_PURPOSE',
    'PRODUCT_COMBINATION',
    'NFLAG_INSURED_ON_APPROVAL'
]

prev_small = new_prev_df[prev_keep].copy()

# ============================================
# 2. NUMERIC aggregation → median
# ============================================

num_cols = [
    'AMT_ANNUITY',
    'AMT_APPLICATION',
    'AMT_CREDIT',
    'AMT_GOODS_PRICE',
    'CNT_PAYMENT',
    'DAYS_DECISION',
    'SELLERPLACE_AREA'
]

prev_num_agg = (
    prev_small.groupby('SK_ID_CURR')[num_cols]
    .median()
    .reset_index()
)

# ============================================
# 3. CATEGORICAL aggregation → mode
# ============================================

cat_cols = [
    'NAME_CONTRACT_STATUS',
    'CODE_REJECT_REASON',
    'NAME_CASH_LOAN_PURPOSE',
    'PRODUCT_COMBINATION',
    'NFLAG_INSURED_ON_APPROVAL'
]

def mode_or_unknown(x):
    m = x.mode()
    return m.iloc[0] if not m.empty else 'UNKNOWN'

prev_cat_agg = (
    prev_small.groupby('SK_ID_CURR')[cat_cols]
    .agg(mode_or_unknown)
    .reset_index()
)

# ============================================
# 4. Combine numeric + categorical
# ============================================

prev_agg = prev_num_agg.merge(prev_cat_agg, on='SK_ID_CURR', how='left')

# ============================================
# 5. Rename columns to avoid collisions
# ============================================

prev_agg.rename(columns=lambda c: f"PREV_{c}" if c != 'SK_ID_CURR' else c,
                inplace=True)

# ============================================
# 6. Merge with application data
# ============================================

merged_df = new_app_df.merge(prev_agg, on='SK_ID_CURR', how='left')

print("Final merged_df shape:", merged_df.shape)
Final merged_df shape: (307511, 94)
In [ ]:
# check the records to see the new merged df
merged_df.head(5)
In [143]:
#Count null values per column
null_count = merged_df.isnull().sum()
#Total number of rows
total_count = merged_df.shape[0]
# Calculate percentage of nulls and round to 2 decimal places
null_pct3 = ((null_count / total_count) * 100).round(2)
#Filter only columns with missing values (makes plot cleaner)
null_pct3 = null_pct3[null_pct3 > 0].sort_values(ascending=False)

#Plot using pointplot
plt.figure(figsize=(16,5))
sns.pointplot(x=null_pct3.index, y=null_pct3.values)
plt.xticks(rotation=90)
plt.xlabel("Columns")
plt.ylabel("Missing Values (%)")
plt.title("Percentage of Null Values per Column")
plt.show()
No description has been provided for this image
In [ ]:
#check for missing value in percentage
null_pct3.sort_values(ascending=True)
  • There are many 'PREV' columns with NAN.These nan appear because the customer has no previous loans. We will use 0 for these numerical columns.
In [144]:
prev_cols = ['PREV_DAYS_DECISION','PREV_AMT_CREDIT','PREV_AMT_APPLICATION','PREV_CNT_PAYMENT','PREV_AMT_ANNUITY','PREV_AMT_GOODS_PRICE','PREV_SELLERPLACE_AREA']

for col in prev_cols:
    merged_df[col].fillna(0, inplace=True)
In [145]:
merged_df['YEARS_EMPLOYED'].fillna(merged_df['YEARS_EMPLOYED'].median(),inplace=True)
In [146]:
cat_col = ['PREV_NAME_CONTRACT_STATUS','PREV_NFLAG_INSURED_ON_APPROVAL','PREV_CODE_REJECT_REASON','PREV_NAME_CASH_LOAN_PURPOSE','PREV_PRODUCT_COMBINATION']

for col in cat_col:
    merged_df[col].fillna('NA', inplace=True)
In [147]:
#check for remaining missing values
missing_count = merged_df.isnull().sum()
missing_data = missing_count[missing_count > 0]
missing_data
Out[147]:
CODE_GENDER    4
dtype: int64
In [148]:
merged_df['CODE_GENDER'].fillna(merged_df['CODE_GENDER'].mode()[0],inplace=True)
In [149]:
#check again
missing_count = merged_df.isnull().sum()
missing_data = missing_count[missing_count > 0]
missing_data
Out[149]:
Series([], dtype: int64)
  • We will also perform feature binning for the numerical columns that we will use for bivariate analysis, categorical-continous columns. This is to make our visualization cleaner.

PREV_ANNUITY¶

In [150]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
merged_df['PREV_ANNUITY_BIN_FREQ'] = pd.qcut(
    merged_df['PREV_AMT_ANNUITY'], 
    q=N_BINS, 
    # Optional: name your categories (e.g., Low, Medium, High)
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

# Check the results: notice the counts are much closer to each other
merged_df['PREV_ANNUITY_BIN_FREQ'].value_counts().sort_index()
Out[150]:
PREV_ANNUITY_BIN_FREQ
Q1_Low         76878
Q2_MidLow     125809
Q3_MidHigh     27946
Q4_High        76878
Name: count, dtype: int64

PREV_APPLICATION¶

In [282]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
merged_df['PREV_APPLICATION_BIN_FREQ'] = pd.qcut(
    merged_df['PREV_AMT_APPLICATION'], 
    q=N_BINS, 
    # Optional: name your categories (e.g., Low, Medium, High)
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

# Check the results: notice the counts are much closer to each other
merged_df['PREV_APPLICATION_BIN_FREQ'].value_counts().sort_index()
Out[282]:
PREV_APPLICATION_BIN_FREQ
Q1_Low        76889
Q2_MidLow     76875
Q3_MidHigh    76871
Q4_High       76876
Name: count, dtype: int64

PREV_CREDIT¶

In [152]:
# Define the number of quantiles/bins (e.g., 4 quartiles)
N_BINS = 4 

# Create the new binned column
merged_df['PREV_CREDIT_BIN_FREQ'] = pd.qcut(
    merged_df['PREV_AMT_CREDIT'], 
    q=N_BINS, 
    # Optional: name your categories (e.g., Low, Medium, High)
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' # Handles cases where values are identical at the cutoff
)

# Check the results: notice the counts are much closer to each other
merged_df['PREV_CREDIT_BIN_FREQ'].value_counts().sort_index()
Out[152]:
PREV_CREDIT_BIN_FREQ
Q1_Low        76884
Q2_MidLow     76873
Q3_MidHigh    76876
Q4_High       76878
Name: count, dtype: int64
In [ ]:
unique_count = merged_df['PREV_CNT_PAYMENT'].nunique()
print(f"Unique values in PREV_CNT_PAYMENT: {unique_count}")

PREV_DAYS_DECISION¶

In [153]:
# Create the new binned column using pd.qcut
# This splits the data into 4 equal-sized groups based on the count of previous payments.
N_BINS = 4 

merged_df['PREV_DAYS_DECISION_BIN_FREQ'] = pd.qcut(
    merged_df['PREV_DAYS_DECISION'], 
    q=N_BINS, # Splits the data into 4 bins (quartiles)
    # 4 bins require exactly 4 labels
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' 
)

merged_df['PREV_DAYS_DECISION_BIN_FREQ'].value_counts().sort_index()
[Binned Feature Result]
PREV_DAYS_DECISION_BIN_FREQ
Q1_Low        77099
Q2_MidLow     76673
Q3_MidHigh    76914
Q4_High       76825
Name: count, dtype: int64

PREV_GOODS_PRICE¶

In [154]:
# Create the new binned column using pd.qcut
# This splits the data into 4 equal-sized groups based on the count of previous payments.
N_BINS = 4 

merged_df['PREV_GOODS_PRICE_BIN_FREQ'] = pd.qcut(
    merged_df['PREV_AMT_GOODS_PRICE'], 
    q=N_BINS, # Splits the data into 4 bins (quartiles)
    # 4 bins require exactly 4 labels
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    duplicates='drop' 
)

merged_df['PREV_GOODS_PRICE_BIN_FREQ'].value_counts().sort_index()
[Binned Feature Result]
Out[154]:
PREV_GOODS_PRICE_BIN_FREQ
Q1_Low         76878
Q2_MidLow     127866
Q3_MidHigh     25896
Q4_High        76871
Name: count, dtype: int64

PREV_PAYMENT¶

In [155]:
N_BINS = 4 

# --- SIMPLIFIED SOLUTION: Using pd.cut for fixed-width binning ---
# Since pd.qcut was unstable, pd.cut is used to segment the feature based on value range.
# This operation is stable and simple, avoiding all the previous errors.

merged_df['PREV_PAYMENT_BIN_FREQ'] = pd.cut(
    merged_df['PREV_CNT_PAYMENT'],
    bins=N_BINS, # Automatically calculates 4 equally spaced bins between min and max
    labels=['Q1_Low', 'Q2_MidLow', 'Q3_MidHigh', 'Q4_High'], 
    include_lowest=True # Ensures the minimum value is included in the first bin
)

# Check the results
merged_df['PREV_PAYMENT_BIN_FREQ'].value_counts().sort_index()
[Binned Feature Result (pd.cut)]
PREV_PAYMENT_BIN_FREQ
Q1_Low        277787
Q2_MidLow      26932
Q3_MidHigh      2063
Q4_High          729
Name: count, dtype: int64

7.2 Segmented Univariate Analysis¶

Univariate Categorical Analysis¶

  • We will perform this analysis on these columns
    • PREV_NAME_CONTRACT_STATUS, PREV_NFLAG_INSURED_ON_APPROVAL, PREV_CODE_REJECT_REASON, PREV_NAME_CASH_LOAN_PURPOSE, PREV_PRODUCT_COMBINATION

PREV_NAME_CONTRACT_STATUS¶

In [228]:
cat_plot(merged_df,'PREV_NAME_CONTRACT_STATUS')
No description has been provided for this image
  • Approved contract has the highest defaulters & payment on time.
In [274]:
univariate_decision_support_risk_plot(merged_df,'PREV_NAME_CONTRACT_STATUS')
No description has been provided for this image
  • Refused previous loans have the highest default risk rate followed by cancelled previous loans.
  • NA (no previous loans) has the lowest default risk rate followed by approved loans.

PREV_NFLAG_INSURED_ON_APPROVAL¶

In [229]:
cat_plot(merged_df,'PREV_NFLAG_INSURED_ON_APPROVAL')
No description has been provided for this image
  • People with no previous loan status has the highest defaulters and those paying on time.
In [230]:
univariate_decision_support_risk_plot(merged_df,'PREV_NFLAG_INSURED_ON_APPROVAL')
No description has been provided for this image
  • People with no previous loans status have the highest default risk rate.
  • People with 1 previous loan status have the lowest default rate.

PREV_CODE_REJECT_REASON¶

In [231]:
cat_plot(merged_df,'PREV_CODE_REJECT_REASON')
No description has been provided for this image
  • The number of defaulters and those paying on time is the highest for XAP.
In [232]:
univariate_decision_support_risk_plot(merged_df,'PREV_CODE_REJECT_REASON')
No description has been provided for this image
  • SCOFR has the highest default risk rate.
  • NA has the lowest default risk rate.

PREV_NAME_CASH_LOAN_PURPOSE¶

In [233]:
cat_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE')
No description has been provided for this image
  • XAP has the highest number of defaulters and those paying on time.
In [234]:
univariate_decision_support_risk_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE')
No description has been provided for this image
  • Cash loan for water supply has the highest default risk rate.
  • The default rate for refusal to name the goal, hobby & money for a third person is non-existant.
  • The default rate is the lowest for business development.

PREV_PRODUCT_COMBINATION¶

In [235]:
cat_plot(merged_df,'PREV_PRODUCT_COMBINATION')
No description has been provided for this image
  • The number of defaulters & those paying on time is the highest for POS household with interest.
In [236]:
univariate_decision_support_risk_plot(merged_df,'PREV_PRODUCT_COMBINATION')
No description has been provided for this image
  • The highest default risk rate is for Cash X-Sell: high.
  • The lowest default risk rate is for POS industry wihtout interest.

Univariate Numerical Analysis¶

  • We will perform this analysis on these columns:
    • PREV_AMT_ANNUITY, PREV_AMT_APPLICATION, PREV_AMT_CREDIT, PREV_AMT_GOODS_PRICE , PREV_CNT_PAYMENT, PREV_DAYS_DECISION, PREV_SELLERPLACE_AREA
In [237]:
df0 = merged_df[merged_df['TARGET'] == 0]
df1 = merged_df[merged_df['TARGET'] == 1]

PREV_AMT_ANNUITY¶

In [238]:
cont_plot(merged_df,'PREV_AMT_ANNUITY')
No description has been provided for this image
  • The risk density is the highest among those with previous amount annuity between 0 and 5000.
  • The risk density is the lowest among those with high previous amount annuity between 10000 and 15000.
In [239]:
univariate_decision_support_risk_plot(merged_df,'PREV_ANNUITY_BIN_FREQ')
No description has been provided for this image
  • Q2_MidLow has the highest default risk rate.
  • Q4_High has the lowest default risk rate.

PREV_AMT_APPLICATION¶

In [240]:
cont_plot(merged_df,'PREV_AMT_APPLICATION')
No description has been provided for this image
  • The risk density is the highest between 0 and 50000.
  • The risk density is the lowest for previous amount application > 400000.
In [241]:
univariate_decision_support_risk_plot(merged_df,'PREV_APPLICATION_BIN_FREQ')
No description has been provided for this image
  • Q1_Low has the highest default risk rate.
  • Q4_High has the lowest default risk rate.

PREV_AMT_CREDIT¶

In [242]:
cont_plot(merged_df,'PREV_AMT_CREDIT')
No description has been provided for this image
  • The risk density is the highest between 0 and 100000.
  • The risk dnesity is the lowest for previous amount credit > 500000.
In [243]:
univariate_decision_support_risk_plot(merged_df,'PREV_CREDIT_BIN_FREQ')
No description has been provided for this image
  • Q1_Low has the highest default risk rate.
  • Q3_MidHigh has the lowest default risk rate.
In [244]:
cont_plot(merged_df,'PREV_AMT_GOODS_PRICE')
No description has been provided for this image
  • The highest risk density is around 50000.
  • The lowest risk density is between 100000 & 150000.
In [245]:
univariate_decision_support_risk_plot(merged_df,'PREV_GOODS_PRICE_BIN_FREQ')
No description has been provided for this image
  • The default risk rate is the highest for Q2_MidLow.
  • The default risk rate is the lowest for Q3_MidHigh.

PREV_CNT_PAYMENT¶

In [246]:
cont_plot(merged_df,'PREV_CNT_PAYMENT')
No description has been provided for this image
  • Client with around 12 payment has the lowest risk density.
In [247]:
univariate_decision_support_risk_plot(merged_df,'PREV_PAYMENT_BIN_FREQ')
No description has been provided for this image
  • Q3_MidHigh has the highest default risk rate.
  • Q1_Low has the lowest default risk rate.

PREV_DAYS_DECISION¶

In [248]:
cont_plot(merged_df,'PREV_DAYS_DECISION')
No description has been provided for this image
  • Around 300 days, the risk density is the highest.
  • People who recently made decision has the lowest risk density.
In [249]:
univariate_decision_support_risk_plot(merged_df,'PREV_DAYS_DECISION_BIN_FREQ')
No description has been provided for this image
  • Q1_Low has the highest default risk rate.
  • Q4_High has the lowest default risk rate.

SELLERPLACE_AREA¶

In [250]:
cont_plot(merged_df,'PREV_SELLERPLACE_AREA')
No description has been provided for this image
  • Due to nature of reading, prev_sellerplace area is not a good indicator for default risk analysis

7.3 Bivariate Analysis¶

Categorical-categorical columns¶

  • These are the pairings used:
    • PREV_NAME_CONTRACT_STATUS - NAME_HOUSING_TYPE
    • PREV_NFLAG_INSURED_ON_APPROVAL - OCCUPATION_TYPE
    • PREV_CODE_REJECT_REASON - CODE_GENDER
    • PREV_NAME_CASH_LOAN_PURPOSE - NAME_FAMILY_STATUS
In [251]:
df0 = merged_df[merged_df['TARGET']==0]
df1 = merged_df[merged_df['TARGET']==1]

PREV_NAME_CONTRACT_STATUS - NAME_HOUSING_TYPE¶

In [252]:
cat_cat_plot(merged_df,'PREV_NAME_CONTRACT_STATUS','NAME_HOUSING_TYPE')
No description has been provided for this image
  • The number of defaulters & others is the highest for those with approved house/apartment across all contract status in both graphs.
In [253]:
bivariate_decision_support_risk_plot(merged_df,'PREV_NAME_CONTRACT_STATUS','NAME_HOUSING_TYPE')
No description has been provided for this image
  • The highest default risk rate is among those who stays in a rented apartment & has their previous contract refused.
  • Among the different contract status, those in a rented apartment has the highest default risk rate.

PREV_NFLAG_INSURED_ON_APPROVAL- OCCUPATION_TYPE¶

In [254]:
cat_cat_plot(merged_df,'PREV_NFLAG_INSURED_ON_APPROVAL','OCCUPATION_TYPE')
No description has been provided for this image
  • Unkown occupation type seems to have the highest number of defautlers & payment on time across all insurance approval types.
In [255]:
bivariate_decision_support_risk_plot(merged_df,'PREV_NFLAG_INSURED_ON_APPROVAL','OCCUPATION_TYPE')
No description has been provided for this image
  • Low skill laborers have the highest default rate across all insurance approval flags.
  • Accountants hav the lowest default rate across all insurance approval flags.

PREV_CODE_REJECT_REASON - CODE_GENDER¶

In [256]:
cat_cat_plot(merged_df,'PREV_CODE_REJECT_REASON','CODE_GENDER')
No description has been provided for this image
  • Females have the higher number of defaulters & payment on time as compared to males for all different codes.
In [257]:
bivariate_decision_support_risk_plot(merged_df,'PREV_CODE_REJECT_REASON','CODE_GENDER')
No description has been provided for this image
  • Males have the higher default rate across all codes except for SYSTEM & XNA.
  • For code SYSTEM, the default rate between males & females are the same.

PREV_NAME_CASH_LOAN_PURPOSE - NAME_FAMILY_STATUS¶

In [258]:
cat_cat_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE','NAME_FAMILY_STATUS')
No description has been provided for this image
  • Married has the highest number of defaulters and payment on time.
In [259]:
bivariate_decision_support_risk_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE','NAME_FAMILY_STATUS')
No description has been provided for this image
  • Civil marriage has the highest default risk rate for most of the cash loan purposes.
  • Default rate for hobby and refusal to name the goal is non-existant.

Categorical-continuos columns¶

  • These are the columns we will perform our analysis on:
    • PREV_AMT_ANNUITY - NAME_INCOME_TYPE
    • PREV_AMT_APPLICATION - OCCUPATION_TYPE
    • PREV_AMT_CREDIT - NAME_FAMILY_STATUS
    • PREV_CNT_PAYMENT - FLAG_OWN_CAR

PREV_AMT_ANNUITY - NAME_INCOME_TYPE¶

In [260]:
cat_cont_plot(merged_df,'PREV_AMT_ANNUITY','NAME_INCOME_TYPE')
No description has been provided for this image
  • The box (IQR) for Maternity Leave is located much higher than any other category, with a median PREV_AMT_ANNUITY around 20000.
  • The Maternity leave category is barely visible, showing a near-total lack of volume in the non-defaulting population.
  • The Businessman category is virtually absent from the Defaulter plot and barely visible in the Non-Defaulter plot.
In [287]:
bivariate_decision_support_risk_plot(merged_df,'PREV_ANNUITY_BIN_FREQ','NAME_INCOME_TYPE')
No description has been provided for this image
  • Unemployed income type within Q3_MidHigh or Q4_High has the highest default risk rate.
  • For most of the income types, Q4_High seems to have the lowest default risk rate.

PREV_AMT_APPLICATION - OCCUPATION_TYPE¶

In [262]:
cat_cont_plot(merged_df,'PREV_AMT_APPLICATION','OCCUPATION_TYPE')
No description has been provided for this image
  • The boxs (IQR) overlap each other while the median are about the same in both graphs.
In [284]:
bivariate_decision_support_risk_plot(merged_df,'PREV_APPLICATION_BIN_FREQ','NAME_INCOME_TYPE')
No description has been provided for this image
  • Q2_MidLow Unemployed within Q2_MidLow, Maternity Leave within Q3_MidHigh or Q4_High has the highest default risk rate.
  • The default risk rate for businessman is non-existant.
  • Pensions have the lowest default risk rate across all application categories.

PREV_AMT_CREDIT - NAME_FAMILY_STATUS¶

In [280]:
cat_cont_plot(merged_df,'PREV_AMT_CREDIT','NAME_FAMILY_STATUS')
No description has been provided for this image
  • The boxes (IQR) overlap each other while the median are about the same in both graphs for most of the occupation types.
  • The unknown category is virtually absent from the others graph.
In [288]:
bivariate_decision_support_risk_plot(merged_df,'PREV_CREDIT_BIN_FREQ','NAME_FAMILY_STATUS')
No description has been provided for this image
  • For almost all credit categories, civil marriage has the highest default risk rate.
  • Civil marriage within Q1_Low has the highest default risk rate.
  • The default risk rate for unknown is non-existant.
  • Widow has the lowest default risk rate across all credit categories.

Multivariate Analysis¶

  • We will perform categorical-categorical-categorical types of multivariate analysis:
    • Categorical-categorical-categorical
    • Categorical-categorical-continous

Categorical - cateogorical - cateogorical¶

PREV_NAME_CONTRACT_STATUS - NAME_HOUSING_TYPE - NAME_INCOME_TYPE¶

In [159]:
cat_cat_cat_plot(merged_df,'PREV_NAME_CONTRACT_STATUS','NAME_HOUSING_TYPE','NAME_INCOME_TYPE')
No description has been provided for this image
  • Working group,state servant group,commercial associate & pensioner group with their previou contract status approved has the highest number of payment on time & defaulters
In [285]:
multivariate_decision_support_risk_plot(merged_df,'PREV_NAME_CONTRACT_STATUS','NAME_HOUSING_TYPE','NAME_INCOME_TYPE')
No description has been provided for this image
  • For the Commercial Associate group, the Rented Apartment housing type shows the highest default risk rate within the categories of Approved, Cancelled, and NA previous contract status
  • For the Commerical Associate group, the NA contract status shows the lowest default risk rate within all the contract status types.
  • For pensioner group, the NA contract status shows the highest default risk within all the contract status types.
  • For the state servent group, the house/apartment housing type with unused offer shows the highest default risk rate within all contract stauts types.
  • For maternity leave, the house/apartment housing type with approved contract status shows the highest defualt risk rate across all graphs.
  • Similar to maternity leave, municipal apartment housing type with approved contract status & house/apartment with canceled status shows the highest defualt risk rate across all graphs.
  • The default risk rate for businessman group is non-existant.

PREV_NAME_CASH_LOAN_PURPOSE - NAME_FAMILY_STATUS- FLAG_OWN_REALTY¶

In [264]:
cat_cat_cat_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE','NAME_FAMILY_STATUS','FLAG_OWN_REALTY')
No description has been provided for this image
  • For those owing & not owning an asset, married family status shows the highest number of payment on tine across the XAP, XNA & NA cash loan purpose categories.
In [265]:
multivariate_decision_support_risk_plot(merged_df,'PREV_NAME_CASH_LOAN_PURPOSE','NAME_FAMILY_STATUS','FLAG_OWN_REALTY')
No description has been provided for this image
  • For those not owning an asset, civil marriage housing type within the buying a holiday home/land category shows the highest default risk rate.
  • For those owning an asset, separated civil marriage housing type within the car repairs category shows the highest default risk rate.

PREV_AMT_ANNUITY - NAME_INCOME_TYPE - FLAG_OWN_REALTY¶

In [289]:
cat_cat_cat_plot(merged_df,'PREV_ANNUITY_BIN_FREQ','NAME_INCOME_TYPE','FLAG_OWN_REALTY')
No description has been provided for this image
  • For the groups owning & not owning an asset, working category has the highest payment on time within all annuity categories.
In [290]:
multivariate_decision_support_risk_plot(merged_df,'PREV_ANNUITY_BIN_FREQ','NAME_INCOME_TYPE','FLAG_OWN_REALTY')
No description has been provided for this image
  • For those not owning an asset. unemployed income type within Q2_MidLow & Q3_MidHigh has the highest default risk rate.
  • For those owning an asset, unemployed income type within Q2_MidLow & Q3_MidHigh has the highest default risk rate.

PREV_AMT_CREDIT - NAME_FAMILY_STATUS - FLAG_OWN_CAR¶

In [291]:
cat_cat_cat_plot(merged_df,'PREV_CREDIT_BIN_FREQ','NAME_FAMILY_STATUS','FLAG_OWN_CAR')
No description has been provided for this image
  • For those not owning & owning a car, married within the Q4_High category shows the highest number of payment on time.
In [292]:
multivariate_decision_support_risk_plot(merged_df,'PREV_CREDIT_BIN_FREQ','NAME_FAMILY_STATUS','FLAG_OWN_CAR')
No description has been provided for this image
  • For those not owning an asset, civil marriage category within Q1_Low shows the highest default rate.
  • For those owning & not owning an asset, widow shows the the lowest default risk rate within each credit category.
  • For those owning an asset, civil marriage within the Q1_Low category shows the highest default risk rate.

Continous-continuos columns¶

In [ ]:
- PREV_AMT_CREDIT - PREV_AMT_GOODS_PRICE

PREV_AMT_CREDIT - PREV_AMT_GOODS_PRICE¶

In [267]:
cont_cont_plot(merged_df,'PREV_AMT_CREDIT','PREV_AMT_GOODS_PRICE')
No description has been provided for this image
In [268]:
rep_corr, def_corr = calc_correlation(merged_df,'PREV_AMT_CREDIT', 'PREV_AMT_GOODS_PRICE')
--- Correlation for PREV_AMT_CREDIT vs. PREV_AMT_GOODS_PRICE ---
Repayers (T=0) Correlation: 0.8949
Defaulters (T=1) Correlation: 0.8801
  • PREV_AMT_CREDIT is strongly positively correlated to PREV_AMT_GOODS_PRICE.

PREV_AMT_CREDIT - AMT_APPLICATION¶

7.4 Correlation¶

In [266]:
# split data by target 
df0 = merged_df[merged_df['TARGET'] == 0]
df1 = merged_df[merged_df['TARGET'] == 1]

#compute corr
corr0 = df0.corr(numeric_only=True)
corr1 = df1.corr(numeric_only=True)

def get_top_corr(corr, n=10):
    # take upper triangle only (no duplicates)
    c = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    c = c.stack().sort_values(ascending=False)
    return c.head(n)

top10_target1 = get_top_corr(corr1, 10)
top10_target0 = get_top_corr(corr0, 10)

print("\nTOP 10 CORRELATIONS — Target = 1")
print(top10_target1)

print("\nTOP 10 CORRELATIONS — Target = 0")
print(top10_target0)
TOP 10 CORRELATIONS — Target = 1
OBS_30_CNT_SOCIAL_CIRCLE    OBS_60_CNT_SOCIAL_CIRCLE      1.00
AMT_CREDIT                  AMT_GOODS_PRICE               0.98
REGION_RATING_CLIENT        REGION_RATING_CLIENT_W_CITY   0.96
PREV_AMT_APPLICATION        PREV_AMT_CREDIT               0.94
                            PREV_AMT_GOODS_PRICE          0.91
CNT_CHILDREN                CNT_FAM_MEMBERS               0.88
PREV_AMT_CREDIT             PREV_AMT_GOODS_PRICE          0.88
DEF_30_CNT_SOCIAL_CIRCLE    DEF_60_CNT_SOCIAL_CIRCLE      0.87
REG_REGION_NOT_WORK_REGION  LIVE_REGION_NOT_WORK_REGION   0.85
PREV_AMT_ANNUITY            PREV_AMT_GOODS_PRICE          0.84
dtype: float64

TOP 10 CORRELATIONS — Target = 0
OBS_30_CNT_SOCIAL_CIRCLE    OBS_60_CNT_SOCIAL_CIRCLE      1.00
AMT_CREDIT                  AMT_GOODS_PRICE               0.99
REGION_RATING_CLIENT        REGION_RATING_CLIENT_W_CITY   0.95
PREV_AMT_APPLICATION        PREV_AMT_CREDIT               0.94
                            PREV_AMT_GOODS_PRICE          0.92
PREV_AMT_CREDIT             PREV_AMT_GOODS_PRICE          0.89
CNT_CHILDREN                CNT_FAM_MEMBERS               0.88
REG_REGION_NOT_WORK_REGION  LIVE_REGION_NOT_WORK_REGION   0.86
DEF_30_CNT_SOCIAL_CIRCLE    DEF_60_CNT_SOCIAL_CIRCLE      0.86
PREV_AMT_ANNUITY            PREV_AMT_GOODS_PRICE          0.85
dtype: float64

7.5 Conclusion¶

From the above analysis, we have gathered the below insights:

Part A: Strategic Overview & Research Insights

i) Research Questions

  • How do occupational stability and life-stage transitions (like maternity or unemployment) interact with educational
    backgrounds to determine default risk?

    • The research indicates that stable professions such as Accountants and Pensioners act as a hedge against risk across all income groups. Conversely, life transitions like maternity leave or unemployment create severe risk spikes when coupled with lower education levels (Secondary/Secondary Special), suggesting that educational attainment is a primary resilience factor during income volatility.
  • To what extent does "Civil Marriage" status combined with specific loan purposes (real estate/car repairs) signal financial distress compared to traditional marital statuses?

    • Findings suggest that "Civil Marriage" is a significant risk proxy, particularly when applicants are in the lowest income quartiles (Q1_Low) or attempting to fund specific lifestyle assets like holiday homes or car repairs. This identifies a behavioral trend where non-traditional marital structures may correlate with less conservative financial planning in specific credit categories.
  • Does asset ownership effectively mitigate the risk of a "Refused" or "Cancelled" previous contract history?

    • The data shows that for unemployed individuals or those in rented apartments, previous contract status remains a dominant predictor. Even when owning an asset, certain groups (unemployed in Q2/Q3 annuity categories) remain highly likely to default, proving that asset ownership is not a universal solution for poor credit history or employment instability.

ii) Interesting Insights

  • The "Widow" Reliability Factor: Among all female applicants, widows consistently demonstrate the lowest default risk rate within every credit category, regardless of asset ownership. This points to a highly disciplined financial demographic.
  • Office Apartment Advantage: Applicants residing in office apartments show a consistently lower likelihood of default across all goods price groups, serving as a stronger positive indicator than general home ownership.
  • The Commercial Associate Exception: While Academic Degree holders are generally the safest borrowers, those working as "Commercial Associates" deviate from this trend and show a higher likelihood of defaulting, highlighting role-specific volatility even among the highly educated.

Part B: Specific Findings¶

  1. Employment Status and Occupational Risk Profiles
  • This section details the default likelihood based on the applicant's current employment status, occupation type, and income category interactions.

    High Default Risk (Employment & Low Skill)

    • Maternity leave income type specifically with secondary/secondary special education are very likely to default.
    • Maternity leave laborers are more likely to default.
    • Low-skill laborers within Q1_Low income group are more likely to default.
    • Low skill laborers with NA nflag insured on approval are more likely to default.
    • Those unemployed within the Q2_MidLow credit group are more likely to default.
    • Those in Q2_MidLow credit group while being unemployed are more likely to default.
    • Those unemployed within Q3_MidHigh or Q4_High previous annuity group are more likely to default.
    • Those who are unemployed belonging to Q2_MidLow, maternity leave within Q3_MidHigh & Q4_High previous application are more likely to default.
    • Across all education types, those who are unemployed & not owning an asset while those who are on maternity leave & owning an asset are more likely to default.
    • For those owning an asset, unemployed income type within Q2_MidLow annuity category or Q3_MidHigh are more likely to default.

    Low Default Risk (Stable Occupations)

    • Pensioners are less likely to default.
    • Businessman & students are the least likely to default, indicating a very high likelihood of paying on time.
    • Pensioners are more likely to pay on time in almost every credit groups.
    • Accountants consistently show a higher likelihood of paying on time across all income groups.
    • State servent within Q4_High are less likely to default.
    • Pensioners are less likely to default within each previous application groups.
  1. Education Level and Financial Group Interactions
  • This section focuses on how education level influences default risk, particularly when combined with annuity groups.

    High Default Risk (Lower Education)

    • Those with lower secondary education wthin the Q3_MidHigh annuity group are more likely to default.
    • Those with lower secondary education regardless of whether they own an asset or have a certain number of children are more likely to default.
    • Those with lower secondary education belonging to Q3_MidLow or Q3_MidHigh annuity group are more likely to default.
    • Those with academic degree specifically who are commercial associates & working are likely to default while others being least likely.

    Low Default Risk (Higher Education)

    • Academic degree regardless of the annuity group consistently shows a high likelihood of paying on time.
    • Most of the people with academic degree & 2 or 3+ children are most likely to repay on time.
    • Those with academic degree either belonging to Q1_Low annuity group or Q3_MidHigh group are least likely to default.
  1. Demographics, Family Status, and Gender
  • This theme groups findings related to gender, family size, marital status, and related risk categories.

    High Default Risk (Demographics & Family Structure)

    • Regardless of the family status, males are more likely to default as compared to females.
    • Those with 3+ children are more likely to default while those with 0 children are less likely to default regardless of whether these two groups own a car or not.
    • Similarly, we found that those with 5+ family members are more likely to default while those with 2 members are less likely to default regardless of whether they own an asset or not.
    • Males with SCOFR code are more likely to default while females with VERIF less likely to default.
    • Civil marriage with buying a home land are more likely to default.
    • Civil marriage belonging to Q1_Low are more likely to default.
    • For those not owning an asset, civil marriage housing type within the buying a holiday home/land category are more likely to default.
    • For those not owning an asset, civil marriage category within Q1_Low shows the highest default rate.
    • For those owning an asset, separated civil marriage housing type within the car repairs category are more likely to default.
    • For those owning an asset, civil marriage within the Q1_Low category shows the highest default risk rate.

    Low Default Risk (Demographics)

    • Among all female applicants, those who are widows are less likely to default compared to every other family status, indicating a higher likelihood of paying on time.
    • For those owning & not owning an asset, widow shows the the lowest default risk rate within each credit category.
  1. Loan Characteristics and Housing/Collateral
  • This theme focuses on loan type (Cash vs. Revolving) and the applicant's housing situation.

    High Default Risk (Loan & Housing)

    • Regardless of the housing type, people with cash loans are more likely to default as compared to revolving loans.
    • Specifically, we found that cash loans for apartment has the highest likelihood of defaulting.
    • Those in rented apartment within Q2_MidLow goods price group are more likely to default.
    • Those who stay in a rented apartment & has their previous contract refused are more likely to default.
    • Among all the previous contract status, those in a rented apartment are more likely to default.

    Low Default Risk (Loan & Housing)

    • Revolving loans for municipal apartment are less likely to default.
    • Those in office apartment consistently show that they are less likely to default across all goods price group.
  1. External Scores and Multi-Factor Scenarios
  • This theme covers complex interactions involving external credit scores and highly specific combinations of employment, assets, and previous application history.

    • Those in Q1_Low external source 2 group are more likely to default while those in Q4_High are less likely to default regardless of whether they own an asset or not.
    • Maternity group with their previous contract status approved & staying in a house/apartment are more likely to default.
    • Those unemployed approved contract staying in municipal apartment are more likely to default.
    • Unemployed group staying in house/apartment with contract cancelled are more likely to default.

Credit Risk: Consolidated Action Policy Guide¶

  • This policy guide consolidates all identified default risk factors into mandatory actions (Decline, Review, or Approve) grouped by theme for simplified operational application.

  • Employment Status and Occupational Risk Profiles

    A. Mandatory Decline (High Default Risk)

    • Unemployment/Temporary Leave: Automatically reject all applications from unemployed individuals.
    • Specific Maternity Cases: Automatically reject applicants who are currently on maternity leave AND have Secondary or Secondary Special education.
    • Low-Skill Financial Risk: Automatically reject low-skill laborers who fall within the Q1_Low income group.

    B. Mandatory Review and Mitigation (Conditional Check)

    • Low-Skill Labor & Insurability: Require mandatory collateral or a financial guarantee for any laborer on maternity leave. Additionally, require mandatory insurance for low-skill laborers with an NA nflag.
    • Asset-Holding Unemployed: If an unemployed applicant does own an asset (specifically in Q2_MidLow annuity or Q3_MidHigh groups), the loan must be 100% secured with that asset as collateral, requiring a manual underwriter review.

    C. Instant Approval (Low Default Risk)

    • Stable Profiles: Fast-track and offer the most favorable interest rates and terms to all Pensioners, Businessmen, Students, and Accountants.
    • High-Tier Government Servants: Instantly approve State Servants who are in the Q4_High credit group.
  • Education Level and Financial Group Interactions

    A. Mandatory Decline (High Default Risk)

    • Lower Secondary Education: Automatically reject all applicants with Lower Secondary Education, especially if they are in the Q3_MidLow or ' Q3_MidHigh annuity groups, regardless of asset ownership or number of children.

    B. Mandatory Review and Mitigation (Conditional Check)

    • Commercial Academic Associates: Require a manual review for all Commercial Associates who have an Academic Degree and are currently working.

    C. Instant Approval (Low Default Risk)

    • Academic Degree Holders: Offer premium terms, fast-track processing, and favorable pricing to all applicants with an Academic Degree, especially those with 2 or more children.
  • Demographics, Family Status, and Gender

    A. Mandatory Decline (High Default Risk)

    • Civil Marriage & Financial Risk: Automatically reject applicants in a Civil Marriage if they are in the Q1_Low credit group (regardless of asset ownership) or if the loan is intended for real estate (buying a home land or holiday home).

    B. Mandatory Review and Mitigation (Conditional Check)

    • Gender and Risk: All male applicants must be subject to a higher Debt-to-Income (DTI) threshold compared to female applicants, especially those with the SCOFR code.
    • Family Size: Require a co-signer or mandatory manual review for applicants with 3 or more children, or 5 or more family members.
    • Separated Applicants: Require a co-signer for separated civil marriage applicants who own an asset and are applying for a car repairs loan.

    C. Instant Approval (Low Default Risk)

    • Widow Status: Fast-track and offer the best rates to all female applicants who are widows, as they consistently show the lowest default risk.
    • Family Size: Offer favorable terms and a rate discount to applicants with zero or two family members.
  • Loan Characteristics and Housing/Collateral

    A. Mandatory Decline (High Default Risk)

    • Cash Loan/Housing Specifics: Reject all Cash Loan applications intended for an apartment.
    • Rented Apartment Risk: Reject any applicant living in a Rented Apartment who has a previous contract refusal or is in the Q2_MidLow goods price group.

    B. Mandatory Review and Mitigation (Conditional Check)

    • Cash Loan Premium: All cash loan applications must carry a substantial risk premium or higher collateral requirement compared to revolving loans.

    C. Instant Approval (Low Default Risk)

    • Ideal Loan/Housing/Score Combinations: Instantly approve and offer favorable terms to applicants with a Q4_High External Source 2 score, those living in an Office Apartment, or those applying for a Revolving Loan for a Municipal Apartment.
  • External Scores and Multi-Factor Scenarios

    A. Mandatory Decline (High Default Risk)

    • Poor Housing/Score Combinations: Automatically reject all applications from applicants with a Q1_Low external source 2 score.
    • Specific Maternity/Housing Risk (NEW ACTION): Automatically reject applicants on maternity leave if they have a previous contract status approved and reside in a house or apartment.

    B. Mandatory Review and Mitigation (Conditional Check)

    • No new actions required; covered by existing Theme 1/4 rules.

    C. Instant Approval (Low Default Risk)

    • Ideal Loan/Housing/Score Combinations: Instantly approve and offer favorable terms to applicants with a Q4_High External Source 2 score or those covered in Theme 4.